[Ray_J._Solomonoff_(auth.),_Frank_Emmert-Streib,_M_Information theory and statistical learning_1

If we had a prediction model, R , that assigned a probability of S to the text, then it is possible to write a sequence of just−log2S bits, so that the original text, x, can be recovered

Trang 1

Information Theory and Statistical Learning

Trang 2

Information Theory

and Statistical Learning

ABC

Trang 3

and Machine Learning

Center for Cancer Research

and Cell Biology

School of Biomedical Sciences

97 Lisburn Road, Belfast BT9 7BL, UK

v@bio-complexity.com

Matthias DehmerVienna University of TechnologyInstitute of Discrete Mathematicsand Geometry

Wiedner Hauptstr 8–10

1040 Vienna, Austriaand

University of CoimbraCenter for MathematicsProbability and StatisticsApartado 3008, 3001–454Coimbra, Portugalmatthias@dehmer.org

ISBN: 978-0-387-84815-0 e-ISBN: 978-0-387-84816-7

DOI: 10.1007/978-0-387-84816-7

Library of Congress Control Number: 2008932107

c

Springer Science+Business Media, LLC 2009

NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software,

or by similar or dissimilar methodology now known or hereafter developed is forbidden.

The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject

to proprietary rights.

Printed on acid-free paper

springer.com

Trang 4

This book presents theoretical and practical results of information theoretic methodsused in the context of statistical learning Its major goal is to advocate and promotethe importance and usefulness of information theoretic concepts for understandingand developing the sophisticated machine learning methods necessary not only tocope with the challenges of modern data analysis but also to gain further insights

into their theoretical foundations Here Statistical Learning is loosely defined as a

synonym, for, e.g., Applied Statistics, Artificial Intelligence or Machine Learning.Over the last decades, many approaches and algorithms have been suggested in thefields mentioned above, for which information theoretic concepts constitute coreingredients For this reason we present a selected collection of some of the finestconcepts and applications thereof from the perspective of information theory as the

underlying guiding principles We consider such a perspective as very insightful and

expect an even greater appreciation for this perspective over the next years.The book is intended for interdisciplinary use, ranging from Applied Statistics,Artificial Intelligence, Applied Discrete Mathematics, Computer Science, Infor-mation Theory, Machine Learning to Physics In addition, people working in thehybrid fields of Bioinformatics, Biostatistics, Computational Biology, Computa-tional Linguistics, Medical Bioinformatics, Neuroinformatics or Web Mining mightprofit tremendously from the presented results because these data-driven areas are

in permanent need of new approaches to cope with the increasing flood of dimensional, noisy data that possess seemingly never ending challenges for theiranalysis

high-Many colleagues, whether consciously or unconsciously, have provided us withinput, help and support before and during the writing of this book In particular wewould like to thank Shun-ichi Amari, Hamid Arabnia, G¨okhan Bakır, Alexandru T.Balaban, Teodor Silviu Balaban, Frank J Balbach, Jo˜ao Barros, Igor Bass, MatthiasBeck, Danail Bonchev, Stefan Borgert, Mieczyslaw Borowiecki, Rudi L Cilibrasi,Mike Coleman, Malcolm Cook, Pham Dinh-Tuan, Michael Drmota, Shinto Eguchi,

B Roy Frieden, Bernhard Gittenberger, Galina Glazko, Martin Grabner, EarlGlynn, Peter Grassberger, Peter Hamilton, Kateˇrina Hlav´aˇckov´a-Schindler, Lucas

R Hope, Jinjie Huang, Robert Jenssen, Attila Kert´esz-Farkas, Andr´as Kocsor,

v

Trang 5

vi Preface

Elena Konstantinova, Kevin B Korb, Alexander Kraskov, Tyll Krüger, Ming Li, J.F.McCann, Alexander Mehler, Marco Möller, Abbe Mowshowitz, Max Mühlhäuser,Markus Müller, Noboru Murata, Arcady Mushegian, Erik P Nyberg, Paulo EduardoOliveira, Hyeyoung Park, Judea Pearl, Daniel Polani, Sándor Pongor, WilliamReeves, Jorma Rissanen, Panxiang Rong, Reuven Rubinstein, Rainer SiegmundSchulze, Heinz Georg Schuster, Helmut Schwegler, Chris Seidel, Fred Sobik, Ray

J Solomonoff, Doru Stefanescu, Thomas Stoll, John Storey, Milan Studeny, UlrichTamm, Naftali Tishby, Paul M.B Vit´anyi, Jos´e Miguel Urbano, Kazuho Watanabe,Dongxiao Zhu, Vadim Zverovich and apologize to all those who have been missedinadvertently We would like also to thank our editor Amy Brais from Springer whohas always been available and helpful Last but not least we would like to thank ourfamilies for support and encouragement during all the time of preparing the bookfor publication

We hope this book will help to spread the enthusiasm we have for this field andinspire people to tackle their own practical or theoretical research problems

Belfast and Coimbra Frank Emmert-Streib

Trang 6

1 Algorithmic Probability: Theory and Applications 1Ray J Solomonoff

2 Model Selection and Testing by the MDL Principle 25Jorma Rissanen

3 Normalized Information Distance 45Paul M B Vit´anyi, Frank J Balbach, Rudi L Cilibrasi, and Ming Li

4 The Application of Data Compression-Based Distances

to Biological Sequences 83Attila Kertész-Farkas, András Kocsor, and Sándor Pongor

5 MIC: Mutual Information Based Hierarchical Clustering 101Alexander Kraskov and Peter Grassberger

6 A Hybrid Genetic Algorithm for Feature Selection

Based on Mutual Information 125Jinjie Huang and Panxiang Rong

7 Information Approach to Blind Source Separation

and Deconvolution 153Pham Dinh-Tuan

8 Causality in Time Series: Its Detection and Quantification by Means

of Information Theory 183Kateˇrina Hlav´aˇckov´a-Schindler

9 Information Theoretic Learning and Kernel Methods 209Robert Jenssen

10 Information-Theoretic Causal Power 231Kevin B Korb, Lucas R Hope, and Erik P Nyberg

vii

Trang 7

14 Model Selection and Information Criterion 333Noboru Murata and Hyeyoung Park

15 Extreme Physical Information as a Principle of Universal Stability 355

B Roy Frieden

16 Entropy and Cloning Methods for Combinatorial Optimization,

Sampling and Counting Using the Gibbs Sampler 385Reuven Rubinstein

Index 435

Trang 8

Frank J Balbach, University of Waterloo, Waterloo, ON, Canada,

Lucas R Hope, Bayesian Intelligence Pty Ltd., lhope@bayesian-intelligence.comKateˇrina Hlaváˇcková-Schindler, Commission for Scientific Visualization,Austrian Academy of Sciences and Donau-City Str 1, 1220 Vienna, Austria andInstitute of Information Theory and Automation of the Academy of Sciences of theCzech Republic, Pod Vodárenskou vˇeˇz´ı 4, 18208 Praha 8, Czech Republic,

katerina.schindler@assoc.oeaw.ac.at

Jinjie Huang, Department of Automation, Harbin University of Science andTechnology, Xuefu Road 52 Harbin 150080, China, jinjiehyh@yahoo.com.cnRobert Jenssen, Department of Physics and Technology, University of Tromsø,

9037 Tromso, Norway, robert.jenssen@phys.uit.no

ix

Trang 9

x Contributors

Attila Kertész-Farkas, Research Group on Artificial Intelligence, Aradi vértanúktere 1, 6720 Szeged, Hungary, kfa@inf.u-szeged.hu

András Kocsor, Research Group on Artificial Intelligence, Aradi vértanúk tere 1,

6720 Szeged, Hungary, kocsor@inf.u-szeged.hu

Kevin B Korb, Clayton School of IT, Monash University, Clayton 3600, tralia, kevin.korb@infotech.monash.edu.au

Aus-Alexander Kraskov, UCL Institute of Neurology, Queen Square, London WC1N3BG, UK, akraskov@ion.ucl.ac.uk

Ming Li, University of Waterloo, Waterloo, ON, Canada, mli@uwaterloo.caMarco M¨oller, Adaptive Systems Research Group, School of Computer Science,University of Hertfordshire, Hatfield, UK XXX@herts.ac.uk

Noboru Murata, Waseda University, Tokyo 169-8555, Japan,

Jorma Rissanen, Helsinki Institute for Information Technology, TechnicalUniversities of Tampere and Helsinki, and CLRC, Royal Holloway, University

of London, London, UK, jorma.rissanen@hiit.fi

Panxiang Rong, Department of Automation, Harbin University of Science andTechnology, Xuefu Road 52 Harbin 150080, China, pxrong@hrbust.edu.cnReuven Rubinstein, Faculty of Industrial Engineering and Management,Technion, Israel Institute of Technology Haifa 32000, Israel,

Trang 10

Algorithmic Probability: Theory

and Applications

Ray J Solomonoff

AbstractWe first define Algorithmic Probability, an extremely powerful method

of inductive inference We discuss its completeness, incomputability, diversity andsubjectivity and show that its incomputability in no way inhibits its use for practicalprediction Applications to Bernoulli sequence prediction and grammar discoveryare described We conclude with a note on its employment in a very strong AI systemfor very general problem solving

1.1 Introduction

Ever since probability was invented, there has been much controversy as to just what

it meant, how it should be defined and above all, what is the best way to predict thefuture from the known past Algorithmic Probability is a relatively recent definition

of probability that attempts to solve these problems

We begin with a simple discussion of prediction and its relationship to ity This soon leads to a definition of Algorithmic Probability (ALP) and its proper-

probabil-ties The best-known properties of ALP are its incomputibility and its completeness (in that order) Completeness means that if there is any regularity (i.e property use-

ful for prediction) in a batch of data, ALP will eventually find it, using a surprisingly

small amount of data The incomputability means that in the search for regularities,

at no point can we make a useful estimate of how close we are to finding the mostimportant ones We will show, however, that this incomputability is of a very benignkind, so that in no way does it inhibit the use of ALP for good prediction One of

the important properties of ALP is subjectivity, the amount of personal experiential

information that the statistician must put into the system We will show that this

R.J Solomonoff

Visiting Professor, Computer Learning Research Centre, Royal Holloway, University of London, London, UK

http://world.std.com/ rjs, e-mail: rjsolo@ieee.org

F Emmert-Streib, M Dehmer (eds.), Information Theory and Statistical Learning, 1 DOI: 10.1007/978-0-387-84816-7 1,

c

Springer Science+Business Media LLC 2009

Trang 11

2 R.J Solomonoff

is a desirable feature of ALP, rather than a “Bug” Another property of ALP is its

diversity – it affords many explanations of data giving very good understanding of

that data

There have been a few derivatives of Algorithmic Probability – MinimumMessage Length (MML), Minimum Description Length (MDL) and StochasticComplexity – which merit comparison with ALP

We will discuss the application of ALP to two kinds of problems: Prediction ofthe Bernoulli Sequence and Discovery of the Grammars of Context Free Languages

We also show how a variation of Levin’s search procedure can be used to search over

a function space very efficiently to find good predictive models

The final section is on the future of ALP – some open problems in its application

to AI and what we can expect from these applications

1.2 Prediction, Probability and Induction

What is Prediction?

“An estimate of what is to occur in the future” – But also necessary is a measure

of confidence in the prediction: As a negative example consider an early AI gram called “Prospector” It was given the characteristics of a plot of land and wasexpected to suggest places to drill for oil While it did indeed do that, it soon be-came clear that without having any estimate of confidence, it is impossible to knowwhether it is economically feasible to spend $100,000 for an exploratory drill rig.Probability is one way to express this confidence

pro-Say the program estimated probabilities of 0.1 for 1,000-gallon yield, 0.1 for 10,000-gallon yield and 0.1 for 100,000-gallon yield The expected yield would

be 0.1 ×1,000 + 0.1×10,000 + 0.1×100,000 = 11,100 gallons At $100 per gallon

this would give $1,110,000 Subtracting out the $100,000 for the drill rig gives an

expected profit of $1,010,000, so it would be worth drilling at that point The moral

is that predictions by themselves are usually of little value – it is necessary to haveconfidence levels associated with the predictions

A strong motivation for revising classical concepts of probability has come fromthe analysis of human problem solving When working on a difficult problem, aperson is in a maze in which he must make choices of possible courses of action Ifthe problem is a familiar one, the choices will all be easy If it is not familiar, therecan be much uncertainty in each choice, but choices must somehow be made Onebasis for choices might be the probability of each choice leading to a quick solution –this probability being based on experience in this problem and in problems like

it A good reason for using probability is that it enables us to use Levin’s SearchTechnique (Sect 1.11) to find the solution in near minimal time

The usual method of calculating probability is by taking the ratio of the number

of favorable choices to the total number of choices in the past If the decision to useintegration by parts in an integration problem has been successful in the past 43%

Trang 12

of the time, then its present probability of success is about 0.43 This method has

very poor accuracy if we only have one or two cases in the past, and is undefined

if the case has never occurred before Unfortunately it is just these situations thatoccur most often in problem solving

On a very practical level: If we cross a particular street 10 times and we get hit

by a car twice, we might estimate that the probability of getting hit in crossing that

street is about 0.2 = 2/10 However, if instead, we only crossed that street twice and

we didn’t get hit either time, it would be unreasonable to conclude that our ity of getting hit was zero! By seriously revising our definition of probability, we areable to resolve this difficulty and clear up many others that have plagued classicalconcepts of probability

probabil-What is Induction?

Prediction is usually done by finding inductive models These are deterministic orprobabilistic rules for prediction We are given a batch of data – typically a series ofzeros and ones, and we are asked to predict any one of the data points as a function

of the data points that precede it

In the simplest case, let us suppose that the data has a very simple structure:

0101010101010

In this case, a good inductive rule is “zero is always followed by one; one isalways followed by zero” This is an example of deterministic induction, and deter-ministic prediction In this case it is 100% correct every time!

There is, however, a common kind of induction problem in which our predictionswill not be that reliable Suppose we are given a sequence of zeros and ones withvery little apparent structure The only apparent regularity is that zero occurs 70% ofthe time and one appears 30% of the time Inductive algorithms give a probabilityfor each symbol in a sequence that is a function of any or none of the previoussymbols In the present case, the algorithm is very simple and the probability of the

next symbol is independent of the past – the probability of zero seems to be 0.7; the probability of one seems to be 0.3 This kind of simple probabilistic sequence

is called a “Bernoulli sequence” The sequence can contain many different kinds ofsymbols, but the probability of each is independent of the past In Sect 1.9 we willdiscuss the Bernoulli sequence in some detail

In general we will not always be predicting Bernoulli sequences and there aremany possible algorithms (which we will call “models”) that tell how to assign

a probability to each symbol, based on the past Which of these should we use?Which will give good predictions in the future?

One desirable feature of an inductive model is that if it is applied to the known sequence, it produces good predictions Suppose R i is an inductive algorithm R i

predicts the probability of an symbol a j in a sequence a1, a2···a nby looking at theprevious symbols: More exactly,

Trang 13

4 R.J Solomonoff

p j = R i (a j |a1.a2···a j−1)

a j is the symbol for which we want the probability a1, a2···a j−1 are the previous

symbols in the sequence Then R iis able to give the probability of a particular value

of a j as a function of the past Here, the values of a j can range over the entire

“alphabet” of symbols that occur in the sequence If the sequence is a binary, a jwill

range over the set 0 and 1 only If the sequence is English text, a jwill range over

all alphabetic and punctuation symbols If R i is a good predictor, for most of the a j,the probability it assigns to them will be large – near one

Consider S, the product of the probabilities that R iassigns to the individual

sym-bols of the sequence, a1, a2···a n S will give the probability that R i assigns to thesequence as a whole

For good prediction we want S as large as possible The maximum value it can have

is one, which implies perfect prediction The smallest value it can have is zero –

which can occur if one or more of the p j are zero – meaning that the algorithmpredicted an event to be impossible, yet that event occurred!

The “Maximum Likelihood” method of model selection uses S only to decide

upon a model First, a set of models is chosen by the statistician, based on his perience with the kind of prediction being done The model within that set having

ex-maximum S value is selected.

Maximum Likelihood is very good when there is a lot of data – which is the area

in which classical statistics operates When there is only a small amount of data, it

is necessary to consider not only S, but the effect of the likelihood of the model itself

on model selection The next section will show how this may be done

1.3 Compression and ALP

An important application of symbol prediction is text compression If an induction

algorithm assigns a probability S to a text, there is a coding method – Arithmetic Coding – that can re-create the entire text without error using just −log2S bits More exactly: Suppose x is a string of English text, in which each character is represented by an 8-bit ASCII code, and there are n characters in x x would be directly represented by a code of just 8n bits If we had a prediction model, R , that assigned a probability of S to the text, then it is possible to write a sequence of

just−log2S bits, so that the original text, x, can be recovered from that bit sequence

without error

If R is a string of symbols (usually a computer program) that describes the

predic-tion model, we will use|R| to represent the length of the shortest binary sequence that describes R If S > 0, then the probability assigned to the text will be in two

Trang 14

parts: the first part is the code for R, which is |R| bits long, and the second part is the code for the probability of the data, as given by R – it will be just −log2S bits

in length The sum of these will be|R|−log2S bits The compression ratio achieved

the length of the compressed code, as a “figure of merit” of a particular inductionalgorithm with respect to a particular text

We want an algorithm that will give good prediction, i.e large S, and small |R|,

so|R| − log2S, the figure of merit, will be as small as possible and the probability it

assigns to the text will be as large as possible Models with|R| larger than optimum are considered to be overfitted Models in which |R| are smaller than optimum are considered to be underfitted By choosing a model that minimizes |R| − log2S, we

avoid both underfitting and overfitting, and obtain very good predictions We willreturn to this topic later, when we tell how to compute|R| and S for particular models

and data sets

Usually there are many inductive models available In 1960, I described rithmic Probability – ALP [5–7], which uses all possible models in parallel forprediction, with weights dependent upon the figure of merit of each model

Algo-P M (a n+1 |a1, a2···a n) =∑2−|R i | S

i R i (a n+1 |a1, a2···a n) (1.2)

P M (a n+1 |a1, a2···a n ) is the probability assigned by ALP to the (n + 1)th symbol of

the sequence, in view of the previous part of the sequence

R i (a n+1 |a1, a2···a n ) is the probability assigned by the ith model to the (n + 1)th

symbol of the sequence, in view of the previous part of the sequence

S i is the probability assigned by R i , (the ith model) to the known sequence,

a1, a2···a nvia (1.1)

2−|R i | S

i is 1/2 with an exponent equal to the figure of merit that R i has with

respect to the data string a1, a2 a n It is the weight assigned to R i( ) This weight

is large when the figure of merit is good – i.e small

Suppose that|R i | is the shortest program describing the ith model using a lar “reference computer” or programming language – which we will call M Clearly

particu-the value of|R i | will depend on the nature of M We will be using machines (or

languages) that are “Universal” – machines that can readily program any able function – almost all computers and programming languages are of this kind

conceiv-The subscript M in P Mexpresses the dependence of ALP on choice of the referencecomputer or language

The universality of M assures us that the value of ALP will not depend very much

on just which M we use – but the dependence upon M is nonetheless important It

will be discussed at greater length in Sect 1.5 on “Subjectivity”

Trang 15

6 R.J Solomonoff

Normally in prediction problems we will have some time limit, T , in which we

have to make our prediction In ALP what we want is a set of models of maximumtotal weight A set of this sort will give us an approximation that is as close aspossible to ALP and gives best predictions To obtain such a set, we devise a search

technique that tries to find, in the available time, T , a set of Models, R i, such thatthe total weight,

com-Does ALP have any advantages over other probability evaluation methods? For

one, it’s the only method known to be complete The completeness property of ALP

means that if there is any regularity in a body of data, our system is guaranteed todiscover it using a relatively small sample of that data More exactly, say we had

some data that were generated by an unknown probabilistic source, P Not knowing

P, we use instead, P M to obtain the Algorithmic Probabilities of the symbols in

the data How much do the symbol probabilities computed by P M differ from their

true probabilities, P? The Expected value with respect to P of the total square error between P and P M is bounded by−1/2lnP0

This is an extremely small error rate The error in probability approaches zero

more rapidly than 1/n Rapid convergence to correct probabilities is a most portant feature of ALP The convergence holds for any P that is describable by a computer program and includes many functions that are formally incomputable.

im-Various kinds of functions are described in the next section The convergence proof

is in Solomonoff [8]

1.4 Incomputability

It should be noted that in general, it is impossible to find the truly best modelswith any certainty – there is an infinity of models to be tested and some take anunacceptably long time to evaluate At any particular time in the search, we willknow the best ones so far, but we can’t ever be sure that spending a little more

Trang 16

time will not give much better models! While it is clear that we can always make

approximations to ALP by using a limited number of models, we can never know

how close these approximations are to the “True ALP” ALP is indeed, formally incomputable.

In this section, we will investigate how our models are generated and how the

incomputability comes about – why it is a necessary, desirable feature of any high performance prediction technique, and how this incomputability in no way inhibits

its use for practical prediction

How Incomputability Arises and How We Deal with It

Recall that for ALP we added up the predictions of all models, using suitableweights:

There are just four kinds of functions that R ican be:

1 Finite compositions of a finite set of functions

2 Primitive recursive functions

3 Partial recursive functions

4 Total recursive functions

Compositions are combinations of a small set of functions The finite power series

3.2 + 5.98 ∗ X − 12.54 ∗ X2

+ 7.44 ∗ X3

is a composition using the functions plus and times on the real numbers Finite

series of this sort can approximate any continuous functions to arbitrary precision

Primitive Recursive Functions are defined by one or more DO loops For example

to define Factorial(X ) we can write

Factorial(0) ← 1

DO I = 1, X

Factorial(I) ← I ∗ Factorial(I − 1)

EndDO

Partial Recursive Functions are definable using one or more W HILE loops For

example, to define the factorial in this way:

Trang 17

8 R.J Solomonoff

The loop will terminate if X is a non negative integer For all other values of X , the loop will run forever In the present case it is easy to tell for which values of X

the loop will terminate

A simple W HILE loop in which it is not so easy to tell:

W HILE X > 4

IF X /2 is an integer T HEN X ← X/2

ELSE X ← 3 ∗ X + 1

EndW HILE

This program has been tested with X starting at all positive integers up to more

than sixty million The loop has always terminated, but no one yet is certain as to

whether it terminates for all positive integers!

For any Total Recursive Function we know all values of arguments for which

the function has values Compositions and primitive recursive functions are all totalrecursive Many partial recursive functions are total recursive, but some are not As

a consequence of the insolvability of Turing’s “Halting Problem”, it will sometimes

be impossible to tell if a certain W HILE loop will terminate or not.

Suppose we use (1.2) to approximate ALP by sequentially testing functions in alist of all possible functions – these will be the partial recursive functions because

this is the only recursively enumerable function class that includes all possible

pre-dictive functions As we test to find functions with good figures of merit (small(|R i | − log2S i )) we find that certain of them don’t converge after say, a time T , of

10 s We know that if we increase T enough, eventually, all converging trials will

converge and all divergent trials will still diverge – so eventually we will get close

to true ALP – but we cannot recognize when this occurs Furthermore for any

fi-nite T , we cannot ever know a useful upper bound on how large the error in the

ALP approximation is That is why this particular method of approximating ALP

is called “incomputable” Could there be another computable approximation nique that would converge? It is easy to show that any computable technique cannot

tech-be “complete” – i.e having very small errors in probability estimates

Consider an arbitrary computable probability method, R0 We will show how to

generate a sequence for which R0’s errors in probability would always be 0.5 or more We start our sequence with a single bit, say zero We then ask R0for the mostprobable next bit If it says “one is more probable”, we make the continuation zero, if

it says “zero is more probable”, we make the next bit one If it says “both are equallylikely” we make the next bit zero We generate the third bit in the sequence in thesame way, and we can use this method to generate an arbitrarily long continuation

of the initial zero

For this sequence, R0 will always have an error in probability of at least one

half Since completeness implies that prediction errors approach zero for all finitely describable sequences, it is clear that R0or any other computable probability method cannot be complete Conversely, any complete probability method, such as ALP, cannot be computable.

If we cannot compute ALP, what good is it? It would seem to be of little valuefor prediction! To answer this objection, we note that from a practical viewpoint, we

Trang 18

never have to calculate ALP exactly – we can always use approximations While

it is impossible to know how close our approximations are to the true ALP, that information is rarely needed for practical induction.

What we actually need for practical prediction:

1 Estimates of how good a particular approximation will be in future problems(called “Out of Sample Error”)

2 Methods to search for good models

3 Quick and simple methods to compare models

For 1., we can use Cross Validation or Leave One Out – well-known methods that

work with most kinds of problems In addition, because ALP does not overfit orunderfit there is usually a better method to make such estimates

For 2 In Sect 1.11 we will describe a variant of Levin’s Search Procedure, for

an efficient search of a very large function space

For 3., we will always find it easy to compare models via their associated

“Figures of Merit”,|R i | − log2(S i)

In summary, it is clear that all computable prediction methods have a seriousflaw – they cannot ever approach completeness On the other hand, while approxi-

mations to ALP can approach completeness, we can never know how close we are

to the final, incomputable result We can, however, get good estimates of the

fu-ture error in our approximations, and this is all that we really need in a practical

prediction system

That our approximations approach ALP assures us that if we spend enough time

searching we will eventually get as little error in prediction as is possible No putable probability evaluation method can ever give us this assurance It is in this sense that the incomputability of ALP is a desirable feature.

com-1.5 Subjectivity

The subjectivity of probability resides in a priori information – the informationavailable to the statistician before he sees the data to be extrapolated This is in-dependent of what kind of statistical techniques we use In ALP this a priori in-

formation is embodied in M, our “Reference Computer” Recall our assignment of

a|R| value to an induction model – it was the length of the program necessary to describe R In general, this will depend on the machine we use – its instruction set.

Since the machines we use are Universal – they can imitate one another – the length

of description of programs will not vary widely between most reference machines

we might consider But nevertheless, using small samples of data (as we often do inAI), these differences between machines can modify results considerably

For quite some time I felt that the dependence of ALP on the reference machinewas a serious flaw in the concept, and I tried to find some “objective” universal

Trang 19

10 R.J Solomonoff

device, free from the arbitrariness of choosing a particular universal machine When

I thought I finally found a device of this sort, I realized that I really didn’t want it –that I had no use for it at all! Let me explain:

In doing inductive inference, one begins with two kinds of information: First, thedata itself, and second, the a priori data – the information one had before seeingthe data It is possible to do prediction without data, but one cannot do predictionwithout a priori information In choosing a reference machine we are given the op-portunity to insert into the a priori probability distribution any information about thedata that we know before we see it

If the reference machine were somehow “objectively” chosen for all inductionproblems, we would have no way to make use of our prior information This lack

of an objective prior distribution makes ALP very subjective – as are all Bayesiansystems

This certainly makes the results “subjective” If we value objectivity, we canroutinely reduce the choice of a machine and representation to certain universal

“default” values – but there is a tradeoff between objectivity and accuracy To obtainthe best extrapolation, we must use whatever information is available, and much ofthis information may be subjective

Consider two physicians, A and B: A is a conventional physician: He diagnoses

ailments on the basis of what he has learned in school, what he has read about

and his own experience in treating patients B is not a conventional physician He is

“objective” His diagnosis is entirely “by the book” – things he has learned in schoolthat are universally accepted He tries as hard as he can to make his judgements free

of any bias that might be brought about by his own experience in treating patients

As a lawyer, I might prefer defending B’s decisions in court, but as a patient, I would prefer A’s intelligently biased diagnosis and treatment.

To the extent that a statistician uses objective techniques, his recommendationsmay be easily defended, but for accuracy in prediction, the additional informationafforded by subjective information can be a critical advantage

Consider the evolution of a priori in a scientist during the course of his life Hestarts at birth with minimal a priori information – but enough to be able to learn towalk, to learn to communicate and his immune system is able to adapt to certainhostilities in the environment Soon after birth, he begins to solve problems andincorporate the problem solving routines into his a priori tools for future problemsolving This continues throughout the life of the scientist – as he matures, his apriori information matures with him

In making predictions, there are several commonly used techniques for inserting

a priori information First, by restricting or expanding the set of induction models to

be considered This is certainly the commonest way Second, by selecting predictionfunctions with adjustable parameters and assuming a density distribution over thoseparameters based on past experience with such parameters Third, we note that much

of the information in our sciences is expressed as definitions – additions to ourlanguage ALP, or approximations of it, avails itself of this information by usingthese definitions to help assign code lengths, and hence a priori probabilities tomodels Computer languages are usually used to describe models, and it is relativelyeasy to make arbitrary definitions part of the language

Trang 20

More generally, modifications of computer languages are known to be able toexpress any conceivable a priori probability distributions This gives us the ability

to incorporate whatever a priori information we like into our computer language

It is certainly more general than any of the other methods of inserting a prioriinformation

1.6 Diversity and Understanding

Apart from accuracy of probability estimate, ALP has for AI another importantvalue: Its multiplicity of models gives us many different ways to understand ourdata

A very conventional scientist understands his science using a single “current

paradigm” – the way of understanding that is most in vogue at the present time

A more creative scientist understands his science in very many ways, and can more

easily create new theories, new ways of understanding, when the “current paradigm”

no longer fits the current data

In the area of AI in which I’m most interested – Incremental Learning – thisdiversity of explanations is of major importance At each point in the life of theSystem, it is able to solve with acceptable merit, all of the problems it’s been giventhus far We give it a new problem – usually its present Algorithm is adequate.Occasionally, it will have to be modified a bit But every once in a while it gets a

problem of real difficulty and the present Algorithm has to be seriously revised At such times, we try using or modifying once sub-optimal algorithms If that doesn’t

work we can use parts of the sub-optimal algorithms and put them together in newways to make new trial algorithms It is in giving us a broader basis to learn fromthe past, that this value of ALP lies

1.6.1 ALP and “The Wisdom of Crowds”

It is a characteristic of ALP that it averages over all possible models of the data:There is evidence that this kind of averaging may be a good idea in a more generalsetting “The Wisdom of Crowds” is a recent book by James Serowiecki that in-vestigates this question The idea is that if you take a bunch of very different kinds

of people and ask them (independently) for a solution to a difficult problem, then asuitable average of their solutions will very often be better than the best in the set

He gives examples of people guessing the number of beans in a large glass bottle, orguessing the weight of a large ox, or several more complex, very difficult problems

He is concerned with the question of what kinds of problems can be solved thisway as well as the question of when crowds are wise and when they are stupid.They become very stupid in mobs or in committees in which a single person is able

to strongly influence the opinions in the crowd In a wise crowd, the opinions are

Trang 21

12 R.J Solomonoff

individualized, the needed information is shared by the problem solvers, and theindividuals have great diversity in their problem solving techniques The methods

of combining the solutions must enable each of the opinions to be voiced These

conditions are very much the sort of thing we do in ALP Also, when we approximate ALP we try to preserve this diversity in the subset of models we use.

1.7 Derivatives of ALP

After my first description of ALP in 1960 [5], there were several related tion models described, minimum message length (MML) Wallace and Boulton[13], Minimum Description Length (MDL) Rissanen [3], and Stochastic Complex-ity, Rissanen [4] These models were conceived independently of ALP – (thoughRissanen had read Kolmogorov’s 1965 paper on minimum coding [1], which isclosely related to ALP) MML and MDL select induction models by minimizingthe figure of merit,|R i | − log2(S i) just as ALP does However, instead of using aweighted sum of models, they use only the single best model

induc-MDL chooses a space of computable models then selects the best model fromthat space This avoids any incomputability, but greatly limits the kinds of modelsthat it can use MML recognizes the incomputability of finding the best model so

it is in principle much stronger than MDL Stochastic complexity, like MDL, firstselects a space of computable models – then, like ALP it uses a weighted sum ofall models in the that space Like MDL, it differs from ALP in the limited types ofmodels that are accessible to it MML is about the same as ALP when the best model

is much better than any other found When several models are of comparable figure

of merit, MML and ALP will differ One advantage of ALP over MML and MDL is

in its diversity of models This is useful if the induction is part of an ongoing process

of learning – but if the induction is used on one problem only, diversity is of muchless value Stochastic Complexity, of course, does obtain diversity in its limited set

of models

1.8 Extensions of ALP

The probability distribution for ALP that I’ve shown is called “The Universal

Dis-tribution for sequential prediction” There are two other universal disDis-tributions I’d

like to describe:

1.8.1 A Universal Distribution for an Unordered Set of Strings

Suppose we have a corpus of n unordered discrete objects, each described by a finite string a : Given a new string, a , what is the probability that it is in the

Trang 22

previous set? In MML and MDL, we consider various algorithms, R i, that assign

probabilities to strings (We might regard them as Probabilistic Grammars) We use for prediction, the grammar, R i, for which

|R i | − log2S i (1.6)

is minimum Here|R i | is the number of bits in the description of the grammar, R i

S i=∏

j

R i (a j ) is the probability assigned to the entire corpus by R i If R kis the best

stochastic grammar that we’ve found, then we use R k (a n+1) as the probability of

a n+1 To obtain the ALP version, we simply sum over all models as before, usingweights 2−|R i | S

i

This kind of ALP has an associated convergence theorem giving very small errors

in probability This approach can be used in linguistics The a jcan be examples ofsentences that are grammatically correct We can use|R i | − log2S i as a likelihood

that the data was created by grammar, R i Section 1.10 continues the discussion ofGrammatical Induction

1.8.2 A Universal Distribution for an Unordered Set

of Ordered Pairs of Strings

This type of induction includes almost all kinds of prediction problems as “special

cases” Suppose you have a set of question answer pairs, Q1, A1; Q2, A2; Q n , A n:

Given a new question, Q n+1, what is the probability distribution over possible

an-swers, A n+1? Equivalently, we have an unknown analog and/or digital transducer,

and we are given a set of input/output pairs Q1, A1; – For a new input Q i, what

is probability distribution on outputs? Or, say the Q iare descriptions of mushrooms

and the A iare whether they are poisonous or not

As before, we hypothesize operators R j (A |Q) that are able to assign a probability

to any A given any Q: The ALP solution is

Trang 23

14 R.J Solomonoff

This ALP system has a corresponding theorem for small errors in probability

As before, we try to find a set of models of maximum weight in the available time.Proofs of convergence theorems for these extensions of ALP are in Solomonoff [10].There are corresponding MDL, MML versions in which we pick the single model

of maximum weight

1.9 Coding the Bernoulli Sequence

First, consider a binary Bernoulli sequence of length n It’s only visible regularity

is that zeroes have occurred n0times and ones have occurred n1times One kind of

model for this data is that the probability of 0 is p and the probability of 1 is 1 − p Call this model R p S p is the probability assigned to the data by R p

S p = p n0(1− p) n1. (1.8)Recall that ALP tells us to sum the predictions of each model, with weight given bythe product of the a priori probability of the model (2−|R i | ) and S

i, the probabilityassigned to the data by the model , i.e.:

∑

i

2−|R i | S

In summing we consider all models with 0≤ p ≤ 1.

We assume for each model, R p, precisionΔ in describing p So p is specified

with accuracy,Δ We have Δ1 models to sum so total weight is 1

We can get about the same result another way: The function p n0(1− p) n1 is (if

n0and n1are large), narrowly peaked at p0= n0

n0+n1 If we used MDL we would use

the model with p = p The a priori probability of the model itself will depend on

Trang 24

how accurately we have to specify p0 If the “width” of the peaked distribution isΔ,

then the a priori probability of model M p0 will be justΔ· p n0

p0(1−p0 )

n0+n1+1· p n0

0(1− p0)n1· 2 If we use Sterling’s approximation for n! (n! ≈ e −n n n √

2πn), it is not difficult to show that

2π= 2.5066 which is roughly equal to 2.

To obtain the probability of a zero following a sequence of n0zeros and n1ones:

We divide the probability of the sequence having the extra zero, by the probability

of the sequence without the extra zero, i.e.:

(n0+n1+1)! can be generalizedfor an alphabet of k symbols

A sequence of k different kinds of symbols has a probability of

n i is the number of times the ith symbol occurs.

This formula can be obtained by integration in a k − 1 dimensional space of the function p n1

1 p n2

2 ··· p n k−1

k−1(1− p1− p2··· − p k −1)n k.Through an argument similar to that used for the binary sequence, the probability

of the next symbol being of the jth type is

n j+ 1

k +∑k

A way to visualize this result: the body of data (the “corpus”) consists of the∑n i

symbols Think of a “pre-corpus” containing one of each of the k symbols If we

think of a “macro corpus” as “corpus plus pre-corpus” we can obtain the probability

of the next symbol being the jth one by dividing the number of occurrences of that

symbol in the macro corpus, by the total number of symbols of all types in the macrocorpus

1 This can be obtained by getting the first and second moments of the distribution, using the fact that

1

p x(1− p) y d p = (x+y+1)! x!y! .

Trang 25

16 R.J Solomonoff

It is also possible to have different numbers of each symbol type in the corpus, enabling us to get a great variety of “a priori probability distributions” forour predictions

pre-1.10 Context Free Grammar Discovery

This is a method of extrapolating an unordered set of finite strings: Given the set of

strings, a1, a2, ···a n , what is the probability that a new string, a n+1, is a member ofthe set? We assume that the original set was generated by some sort of probabilisticdevice We want to find a device of this sort that has a high a priori likelihood (i.e.short description length) and assigns high probability to the data set A good model

R i, is one with maximum value of

Here P(R i ) is the a priori probability of the model R i R i (a j) is the probability

as-signed by R i to data string, a j

To understand probabilistic models, we first define non-probabilistic grammars.

In the case of context free grammars, this consists of a set of terminal symbols and

a set of symbols called nonterminals, one of which is the initial starting symbol, S.

A grammar could then be:

we perform either of the two possible substitutions If we choose BaAd, we would then have to choose substitutions for the nonterminals B and A For B, if we chose aBa we would again have to make a choice for B If we chose a terminal symbol, like b for B, then no more substitutions can be made.

An example of a string generation sequence:

S, BaAd, aBaaAd, abaaAd, abaaABd, abaaaBd, abaaabd.

The string abaaabd is then a legally derived string from this grammar The set of all strings legally derivable from a grammar is called the language of the grammar.

The language of a grammar can contain a finite or infinite number of strings If

Trang 26

we replace the deterministic substitution rules with probabilistic rules, we have a

probabilistic grammar A grammar of this sort assigns a probability to every string

it can generate In the deterministic grammar above, S had two rewrite choices, A had three, and B had two If we assign a probability to each choice, we have a

A 0.3 BAaS 0.2 AB 0.5 a

B 0.4 aBa 0.6 b

In the derivation of abaaab of the previous example, the substitutions would have probabilities 0.9 to get BaAd, 0.4 to get aBaaAd, 0.6 to get abaaAd, 0.2 to get abaaABd, 0.5 to get abaaaBd, and 0.6 to get abaaabd The probability of the string abaabd being derived this way is 0.9×0.4×0.6×0.2×0.5×0.6 = 0.01296 Often

there are other ways to derive the same string with a grammar, so we have to add

up the probabilities of all of its possible derivations to get the total probability of astring

Suppose we are given a set of strings, ab, aabb, aaabbb that were generated by

an unknown grammar How do we find the grammar?

I wouldn’t answer that question directly, but instead I will tell how to find asequence of grammars that fits the data progressively better The best one we findmay not be the true generator, but will give probabilities to strings close to thosegiven by the generator

The example here is that of A Stolcke’s, PhD thesis [12] We start with an ad

hoc grammar that can generate the data, but it over f its it is too complex:

S → ab

→ aabb

→ aaabbb

We then try a series of modifications of the grammar (Chunking and Merging) that

increase the total probability of description and thereby decrease total description

length Merging consists of replacing two nonterminals by a single nonterminal Chunking is the process of defining new nonterminals We try it when a string or substring has occurred two or more times in the data ab has occurred three times so

we define X = ab and rewrite the grammar as

Trang 27

At this point there are no repeated strings or substrings, so we try the operation

Merge which coalesces two nonterminals In the present case merging S and Y

would decrease complexity of the grammar, so we try:

S → X

→ aSb

→ aXb

X → ab Next, merging S and X gives

S → aSb

→ ab which is an adequate grammar At each step there are usually several possible chunk

or merge candidates We chose the candidates that give minimum description length

to the resultant grammar

How do we calculate the length of description of a grammar and its description

of the data set?

Consider the grammar

Trang 28

(other than the first one, S) are not relevant We can describe the right hand side

by the string X s1Y s1aY bs1s2abs1s2aX bs1s2 s1and s2are punctuation symbols s1

marks the end of a string s2marks the end of a sequence of strings that belong tothe same nonterminal The string to be encoded has seven kinds of symbols The

number of times each occurs: X , 2; Y , 2; S, 0; a, 3; b, 3; s1, 5; s2, 3 We can then usethe formula

to compute the probability of the grammar: k = 7, since there are seven symbols and

n1= 2, n2= 2, n3= 0, n4= 3, etc We also have to include the probability of 2, thenumber of kinds of terminals, and of 3, the number of kinds of nonterminals.There is some disagreement in the machine learning community about how best

to assign probability to integers, n A common form is

approx-Its first moment is infinite, which means it is very biased toward large numbers If

we have reason to believe, from previous experience, that n will not be very large,

but will be about λ, then a reasonable form of P(n) might be P(n) = Ae −n/λ, A

being a normalization constant

The forgoing enables us to evaluate P(R i) of (1.16) The∏n

j=1 R i (a j) part is uated by considering the choices made when the grammar produces the data corpus.For each nonterminal, we will have a sequence of decisions whose probabilities can

eval-be evaluated by an expression like (1.14), or perhaps the simpler technique of (1.15)that uses the “pre-corpus” Since there are three nonterminals, we need the product

of three such expressions

The process used by Stolcke in his thesis was to make various trials of ing or merging in attempts to successively get a shorter description length – or toincrease (1.16) Essentially a very greedy method He has been actively working

chunk-on Cchunk-ontext Free Grammar discovery since then, and has probably discovered manyimprovements There are many more recent papers at his website

Most, if not all CFG discovery has been oriented toward finding a single best grammar For applications in AI and genetic programming it is useful to have large sets of not necessarily best grammars – giving much needed diversity One way to

implement this: At each stage of modification of a grammar, there are usually eral different operations that can reduce description length We could pursue suchpaths in parallel perhaps retaining the best 10 or best 100 grammars thus far

Trang 29

sev-20 R.J Solomonoff

Branches taken early in the search could lead to very divergent paths and much

needed diversity This approach helps avoid local optima in grammars and ture convergence when applied to Genetic Programming.

prema-1.11 Levin’s Search Technique

In the section on incomputability we mentioned the importance of good searchtechniques for finding effective induction models The procedure we will describewas inspired by Levin’s search technique [2], but is applied to a different kind ofproblem

Here, we have a corpus of data to extrapolate, and we want to search over a

function space, to find functions (“models”) R i( ) such that 2−|R i | S

i is as large as

possible In this search, for some R i , the time needed to evaluate S i, (the probability

assigned to the corpus by R i), may be unacceptably large – possibly infinite

Suppose we have a (deterministic) context free grammar, G, that can generate

strings that are programs in some computer language (Most computer languageshave associated grammars of this kind.) In generating programs, the grammar will

have various choices of substitutions to make If we give each substitution in a k-way choice, a probability of 1/k, then we have a probabilistic grammar that assigns a priori probabilities to the programs that it generates If we use a functional language

(such as LISP), this will give a probability distribution over all functions it can

gen-erate The probability assigned to the function R i will be denoted by P M (R i) Here

M is the name of the functional computer language P M (R i) corresponds to what wecalled 2−|R i |in our earlier discussions.|R i | corresponds to −log2P M (R i) As before,

S i is the probability assigned to the corpus by R i We want to find functions R i( )

such that P M (R i )S iis as large as possible

Next we choose a small initial time T – which might be the time needed to execute 10 instructions in our Functional Language The initial T is not critical We then compute P M (R i )S i for all R i for which t i /P M (R i ) < T Here t iis the time needed

to construct R i and evaluate its S i

There are only a finite number of R i that satisfy this criterion and if T is very small, there will be very few, if any We remember which R i ’s have large P M (R i )S i

t i < T · P M (R i), so∑i t i , the total time for this part of the search takes time < T ·

∑i P M (R i ) Since the P M (R i) are a priori probabilities,∑i P M (R i) must be less than

or equal to 1, and so the total time for this part of the search must be less than T

If we are not satisfied with the R i we’ve obtained, we double T and do the search again We continue doubling T and searching until we find satisfactory R i’s or until

we run out of time If T is the value of T when we finish the search, then the total time for the search will be T + T /2 + T /4 ··· ≈ 2T .

If it took time t j to generate and test one of our “good” models, R j , then when R j

was discovered, T would be no more than 2t j /P M (R j) – so we would take no more

time than twice this, or 4t j /P M (R j ) to find R j Note that this time limit depends on

R only, and is independent of the fact that we may have aborted the S evaluations

Trang 30

of many R i for which t i was infinite or unacceptably large This feature of LevinSearch is a mandatory requirement for search over a space of partial recursive func-tions Any weaker search technique would seriously limit the power of the inductivemodels available to us.

When ALP is being used in AI, we are solving a sequence of problems of

increas-ing difficulty The machine (or language) M is periodically “updated” by insertincreas-ing subroutines and definitions, etc., into M so that the solutions, R ito problems in the

past result in larger P M (R j ) As a result the t j /P M (R j) are smaller – giving quickersolutions to problems of the past – and usually for problems of the future as well

1.12 The Future of ALP: Some Open Problems

We have described ALP and some of its properties:

First, its completeness: Its remarkable ability to find any irregularities in an

ap-parently small amount of data

Second: That any complete induction system like ALP must be formally incomputable.

Third: That this incomputability imposes no limit on its use for practical

induc-tion This fact is based on our ability to estimate the future accuracy of any particularinduction model While this seems to be easy to do in ALP without using Cross Val-idation, more work needs to be done in this area

ALP was designed to work on difficult problems in AI The particular kind of AIconsidered was a version of “Incremental Learning”: We give the machine a simpleproblem Using Levin Search, it finds one or more solutions to the problem Thesystem then updates itself by modifying the reference machine so that the solutionsfound will have higher a priori probabilities We then give it new problems some-what similar to the previous problem Again we use Levin Search to find solutions –

We continue with a sequence of problems of increasing difficulty, updating aftereach solution is found As the training sequence continues we expect that we willneed less and less care in selecting new problems and that the system will eventu-ally be able to solve a large space of very difficult problems For a more detaileddescription of the system, see Solomonoff [11]

The principal things that need to be done to implement such a system:

* We have to find a good reference language: Some good candidates are APL, LISP,FORTH, or a subset of Assembly Language These languages must be augmentedwith definitions and subroutines that we expect to be useful in problem solving

* The design of good training sequences for the system is critical for getting muchproblem-solving ability into it I have written some general principles on how to

do this [9], but more work needs to be done in this area For early training, itmight be useful to learn definitions of instructions from Maple or Mathematica.For more advanced training we might use the book that Ramanujan used to teachhimself mathematics – “A Synopsis of Elementary Results in Pure and AppliedMathematics” by George S Carr

Trang 31

22 R.J Solomonoff

* We need a good update algorithm It is possible to use PPM, a relatively fast, fective method of prediction, for preliminary updating, but to find more complexregularities, a more general algorithm is needed The universality of the referencelanguage assures us that any conceivable update algorithm can be considered.APL’s diversity of solutions to problems maximizes the information that we areable to insert into the a priori probability distribution After a suitable trainingsequence the system should know enough to usefully work on the problem ofupdating itself

ef-Because of ALP’s completeness (among other desirable properties), we expect that the complete AI system described above should become an extremely powerful

general problem solving device – going well beyond the limited functional

capabil-ities of current incomplete AI systems.

3 Rissanen, J.: Modeling by the shortest data description Automatica 14, 465–471 (1978)

4 Rissanen, J.: Stochastical Complexity and Statistical Inquiry World Scientific, Singapore (1989)

5 Solomonoff, R.J.: A preliminary report on a general theory of inductive inference (Revision

of Report V–131, Feb 1960), Contract AF 49(639)–376, Report ZTB–138 Zator, Cambridge (Nov, 1960) (http://www.world.std.com/˜rjs/pubs.html)

6 Solomonoff, R.J.: A formal theory of inductive inference, Part I Information and Control 7(1), 1–22 (1964)

7 Solomonoff, R.J.: A formal theory of inductive inference, Part II Information and Control 7(2), 224–254 (1964)

8 Solomonoff, R.J.: Complexity-based induction systems: comparisons and convergence rems IEEE Transactions on Information Theory IT–24(4), 422–432 (1978)

theo-9 Solomonoff, R.J.: A system for incremental learning based on algorithmic probability In: Proceedings of the Sixth Israeli Conference on Artificial Intelligence, Computer Vision and Pattern Recognition 515–527 (Dec 1989)

10 Solomonoff, R.J.: Three kinds of probabilistic induction: universal distributions and gence theorems Appears in Festschrift for Chris Wallace (2003)

conver-11 Solomonoff, R.J.: Progress in incremental machine learning TR IDSIA-16-03, revision 2.0 (2003)

12 Stolcke, A.: On learning context free grammars PhD Thesis (1994)

13 Wallace, C.S and Boulton, D.M.: An information measure for classification Computer Journal

11, 185–195 (1968)

Trang 32

Shan, Y., McKay, R.I., Baxter, R., et al.: Grammar Model-Based Program Evolution (Dec 2003)

A recent review of work in this area, and what looks like a very good learning system Discusses mechanics of fitting Grammar to Data, and how to use Grammars to guide Search Problems Solomonoff, R.J.: The following papers are all available at the website: world.std.com/ rjs/ pubs.html.

Stolcke, A., Omohundro, S.: Inducing Probabilistic Grammars by Bayesian Model Merging ICSI, Berkeley (1994) This is largely a summary of Stolcke’s: On Learning Context Free Grammars [12].

A Preliminary Report on a General Theory of Inductive Inference (1960).

A Formal Theory of Inductive Inference Information and Control, Part I (1964).

A Formal Theory of Inductive Inference, Part II (June 1964) – Discusses fitting of context free grammars to data Most of the discussion is correct, but Sects 4.2.4 and 4.3.4 are questionable and equations (49) and (50) are incorrect.

A Preliminary Report and A Formal Theory give some intuitive justification for the way ALP does induction.

The Discovery of Algorithmic Probability (1997) – Gives heuristic background for discovery of ALP Page 27 gives a time line of important publications related to development of ALP Progress in Incremental Machine Learning; Revision 2.0 (Oct 30, 2003) – A more detailed description of the system I’m currently working on There have been important developments since, however.

The Universal Distribution and Machine Learning (2003) – Discussion of irrelevance of putability to applications for prediction Also discussion of subjectivity.

Trang 33

incom-Chapter 2

Model Selection and Testing by the MDL

Principle

Jorma Rissanen

AbstractThis chapter is an outline of the latest developments in the MDL theory

as applied to the selection and testing of statistical models Finding the number of

parameters is done by a criterion defined by an MDL based universal model, while

the corresponding optimally quantized real valued parameters are determined by theso-called structure function following Kolmogorov’s idea in the algorithmic theory

of complexity Such models are optimally distinguishable, and they can be tested

also in an optimal manner, which differs drastically from the Neyman–Pearson ing theory

test-2.1 Modeling Problem

A data generating physical machinery imposes restrictions or properties on data

In statistics we are interested in statistical properties, describable by

distribu-tions as models that can be fitted to a set of data x n = x1, , x n or (y n , x n) =

(y1, x1), , (y n , x n ), in the latter case conditional models to data y ngiven other data

x n This case adds little to the discussion, and to simplify the notations we consideronly the first type of data with one exception on regression

Trang 34

can be fitted to data Typically, we have a set of n parametersθ1,θ2, ,θn, but we

wish to fit sub collections of these – not necessarily the k first Each sub collection

would define a structure To simplify the notations we consider the structure

de-fined by the first k parameters in some sorting of all the parameters We also write

θ=θkwhen the number of the parameters needs to be emphasized, in which case

f (x n;θ, k) is written as f (x n;θk) Finally, the classM can be made nested if we identify two models f (x n;θk ) and f (x n;θk+1) ifθk+1=θk , 0 This can be some-

times useful

There is no unique way to express the statistical properties of a physical ery as a distribution, hence, no unique “true” model As an example, a coin has alot of properties such as its shape and size By throwing it we generate data thatreflect statistical properties of the coin together with the throwing mechanism Even

machin-in this simple case it seems clear that these are not unique for they depend on manyother things that we have no control over All we can hope for is to fit a model, such

as a Bernoulli model, which gives us some information about the coin’s statisticalbehavior At any rate, to paraphrase Laplace’ statement that he needed no axiom ofGod in his celestial mechanics, we want a theory where the “true” model assumption

is not needed

In fitting models to data a yardstick is needed to measure the fitting error In ditional statistics, where a “true” model is hypothesized, the fitting error can be de-fined as the mean difference of the observed result from the true one, which howevermust be estimated from the data, because the “true” model is not known A justifi-cation of the traditional view, however vague, stems from the confusion of statisticswith probability theory If we construct problems in probability theory that resem-ble statistical problems, the relevance of the results clearly depends on how well ourhypothesized “true” model captures the properties in the data In simple cases likethe coin tossing the resemblance can be good, and useful results can be obtained.However, in realistic more complex statistical problems we can be seriously misled

tra-by the results found that way, for they are based on a methodology which is close

to circular reasoning Nowhere is the failure of the logic more blatant than in theimportant problem of hypothesis testing Its basic assumption is that one of the hy-potheses is true even when none exists, which leads to a dangerous distorted theory

We discuss below model testing without such an assumption

In absence of the “true” model a yardstick can be taken as the value f (x n;θk);

a large probability or density value represents a good fit and a small one a bad fit

Equivalently, we can take the number log 1/ f (x n;θk), which by the Kraft inequalitycan be taken as an ideal code length, ideal because a real code length must be integer

valued But there is a well-known difficulty, for frequently log 1/ f (x n;θk)→ 0 as

k → n We do get a good fit, but the properties learned include too much, “noise”

if we fit too complex models To overcome this without resorting to ad hoc means

to penalize the undefined complexity we must face the problem of defining andmeasuring “complexity”

“Complexity” and its close relative “information” are widely used words withambiguous meaning We know of only two formally defined and related no-

tions of complexity which have had significant implications, namely, algorithmic complexity and stochastic complexity Both are related to Shannon’s entropy and

Trang 35

2 Model Selection and Testing by the MDL Principle 27

hence fundamentally to Hartley’s information: the logarithm of the number of

ele-ments in a finite set; that is, the shortest code length as the number of binary digits

in the coded string, with which any element in the set can be encoded Hence, the

amount of “complexity” is measured by the unit, bit, which will also be used to

measure “information”, to be defined later In reality, the set that includes the object

of interest is not always finite, and the optimal code length is taken either literally,whenever possible, or the shortest in a probability sense, or the shortest in the worstcase In practice, however, the sense of optimality is often as good as literally theshortest code length

The formal definition of complexity means that it is always relative to a work within which the description or coding of the object is done A conceptuallysupreme way, due to Solomonoff [23], is to define it as the length of the shortestprogram of a universal computer that delivers it as the output Hence, the frame-work is the programming language of the computer Such a programming languagecan describe the set of partial recursive functions for integers and tuples of them,and if we identify a finite or constructive description with a computer program

frame-we can consider program defined sets and their elements as the objects of interest.With a modified version of Solomonoff’s definition due to Kolmogorov and Chaitin,Kolmogorov defined a data string’s model as a finite set that includes the string, and

the best such model gets defined in a manner which amounts to the MDL

princi-ple The best model we take to give the algorithmic information in the string, given

by the complexity of the best model The trouble with all this is the fact that such

an “algorithmic complexity” itself is noncomputable and hence cannot be mented However, Kolmogorov’s plan is so beautiful and easy to describe withoutany involved mathematics just because it is noncomputable, and we discuss it below

imple-We wish to imitate Kolmogorov’s plan in statistical context in such a way that theneeded parts are computable, and hence that they can at least be approximated with

an estimate of the approximation error The role of the set of programs will be played

by a set of parametric models, (2.1), (2.2) Such a set is simply selected, perhaps in atentative way, but we can compare the behavior of several such suggestions Nothingcan be said of how to find the very best set, which again is a noncomputable problem,and consequently we are trying to do only what can be done, rather than somethingthat cannot be done – ever

In this chapter we discuss the main steps in the program to implement the afore

outlined ideas for a theory of modeling We begin with a section on the MDL

princi-ple and the Bayesian philosophy, and discuss their fundamental differences Thesetwo are often confused Next, we describe three universal models, the first of which,the Normalized Maximum Likelihood model, makes the definition of stochasticcomplexity possible and gives a criterion for fitting the number of parameters This

is followed by an outline of Kolmogorov’s structure function in the algorithmictheory of complexity, and its implementation within probability models The result-ing optimally quantized parameters lead into the idea of optimally distinguishablemodels, which are also defined in an independent manner thereby providing confi-dence in the constructs These, in turn, lead naturally to model testing, in which theinherent lopsidedness in testing a null-hypothesis against a composite hypothesis inthe Neyman–Pearson theory is removed

Trang 36

2.2 The MDL Principle and Bayesian Philosophy

The outstanding feature of Bayesian philosophy is the use of distributions not onlyfor data but also for the parameters, even when their values are non repeating Thiswidens enormously the use of probability models in applications over the orthodoxstatistics To be specific, consider a class of parametric modelsM = { f (x n;μ)},

whereμrepresents any type of parameters, together with a “prior” distribution Q(μ)for them By Bayes’ formula, then, the prior distribution is converted in light ofobserved data into the posterior

For the Bayesians the distribution Q for the parameters represents prior

knowl-edge, and its meaning is the probability of the event that the valueμ is the “true”value This causes some difficulty if no value is “true”, which is often the case An

elaborate scheme has been invented to avoid this by calling Q(μ) the “degree ofbelief” in the valueμ The trouble comes from the fact that any rational demand ofthe behavior of “degrees of belief” makes them to satisfy the axioms for probability,which apparently leaves the original difficulty intact

A much more serious difficulty is the selection of the prior, which obviously plays

a fundamental role in the posterior and all the further developments One attempt is

to try to fit it to the data, but that clearly not only contradicts the very foundation ofBayesian philosophy but without restrictions on the priors disastrous outcomes can

be prevented only by ad hoc means A different and much more worthwhile line ofendeavor is to construct “noninformative” priors, even though there is the difficulty

in defining the “information” to be avoided Clearly, a uniform distribution appeals

to intuition whenever it can be defined, but as we see below there is a lot of usefulknowledge to be gained even with such “noninformative” distributions!

The MDL principle, too, permits the use of distributions for the parameters ever, the probabilities used are defined in terms of code lengths, which not onlygives them a physical interpretation but also permits their optimization thereby re-moving the anomalies and other difficulties in the Bayesian philosophy Because ofthe difference in the aims and the means to achieve them the development of the

How-MDL principle and the questions it raises differ drastically from Bayesian analysis.

It calls for information and coding theory, which are of no particular interest noreven utility for the Bayesians

The MDL principle was originally aimed at obtaining a criterion for estimating

the number of parameters in a class of ARMA models [13], while a related methodwas described earlier in [29] for an even narrower problem The criterion was arrived

Trang 37

at by optimization of the quantification of the real-valued parameters Unfortunatelythe criterion for the ARMA models turned out to be asymptotically equivalent with

BIC [21], which has given the widely accepted wrong impression that the MDL principle is BIC The acronym “BIC” stands for Bayesian Information Criterion,

although the “information” in it has nothing to do with Shannon information norinformation theory

The usual form of the MDL principle calls for minimization of the ideal code

length

log 1/ f (x n;μ) + L(μ), where L(μ) is a prefix code length for the parameters in order to be able to separatetheir code from the rest Because a prefix code length defines a distribution by

Kraft inequality we can write L(μ) = log 1/Q(μ) We call them “priors” to respectthe Bayesian tradition even though they have nothing to do with anybody’s priorknowledge The fact that the meaning of the distribution is the (ideal) code length

log 1/Q(μ) with which the parameter value can be encoded imposes restrictions onthese distributions, and unlike in the Bayesian philosophy we can try to optimize them

The intent with the MDL principle is to obtain the shortest code length of the data

in a self containing manner; i.e., by including in the code length all the parts needed.But technically this amounts to a prefix code length for the data to be calculated,ideally, using only the means provided by the model class, although in some casesthis may have to be slightly augmented Otherwise, regardless of the model classgiven we could get a shorter coding by Kolmogorov complexity, which we want

to exclude Hence, not only are the minimizing parameters and the prior itself to

be included when the code length is calculated but also the probability model forthe priors, and so on This process stops when the last model for a model for amodel is found which either assigns equal code length to its arguments or iscommon knowledge and need not be encoded Since each model teaches a property

of the data, we may stop when no more properties of interest can be learned Usually,

two or three steps in this process suffice For more on the MDL principle we refer

to [7]

Before applying the MDL principle to the model classes (2.1) and (2.2) we trate the process with an example Take an integer n as the object of interest without

illus-any family of distributions As the first “model” we take the set{1,2, ,2 m }, where

m is the smallest integer such that n belongs to the set or that log2n ≤ m We need to encode the model, or the number m We repeat the argument and get a model for the first model, or the smallest integer k such that log2m ≤ k This process ends when the

last model has only one element 1 = 20 Such an actual coding system was described

in [6], and the total code length is about L(n) = log ∗2(n) = log2n + log2log2n + ···,

the sum ending with the last positive iterated logarithm value It was shown in [14]

that P ∗ (n) = C −12−L(n) for C = 2.865 defines a universal prior for the integers.

We can now define log21/P ∗ (n) as the complexity of n and log21/P ∗ (n) − log2n,

or, closely enough, the sum log∗2(log2) = log2log2n + ···, as the “information” in the number n that we learn with the models given In other words, the amount of information is the code length for encoding the models needed to encode the object n.

Trang 38

2.3 Complexity and Universal Models

A universal model is a fundamental construct borrowed from coding theory andlittle known in ordinary statistics Its roots, too, are in the algorithmic theory ofcomplexity In fact, it was the very reason for Solomonoff’s query for the shortestprogram length, because he wanted to have a universal prior for the integers asbinary strings; see the section on Kolmogorov complexity below

Given a class of models (2.1) we define a model f (y n ; k) to be universal for the

class if

1

nlog f (y

n;θ, k) f(y n ; k) → 0 (2.3)for all parametersθ ∈Ωk , and optimal universal if the convergence is the fastest

possible for almost allθ; the convergence is in a probabilistic sense, either in the

mean, taken with respect to f (y n;θ, k), in probability, or almost surely The mean

sense is of particular interest, because we then have the convergence in Kullback–Leibler distance between the two density functions and hence consistency The qual-ification “almost allθ” may be slightly modified; see [15, 16] If we extend thesedefinitions to the model class (2.2), we can talk about consistency in the number ofparameters, and ask again for fastest convergence In the literature studies have beenmade of the weakest criteria under which consistency takes place [8] Such criteriacannot measure consistency in terms of the Kullback–Leibler distance and do notseem to permit query for optimality

2.3.1 The NML Universal Model

Let y n → θ(y n) be the Maximum Likelihood (ML) estimate which minimizes the

ideal code length log 1/ f (y n;θ, k) for a fixed k Consider the maximized joint

den-sity or probability of the data in the minimized negative logarithm form

L(y n , θ) =−log f (y n; θ, k)/h( θ, θ))− logw( θ) , (2.4)

by an application of the MDL principle It was originally obtained as the solution to

Shtarkov’s minmax problem

Trang 39

q D(g n ; k)) + logC n,k , where g ranges over any set that includes f (y n ; k), as well as the associated minmax

problem – although the maximizing distribution is then nonunique

For the special prior w( θ) the joint density function equals the marginal p(y n) =

θ∈Ω k f (y n; θ, k)w( θ)d θ, and the maximized posterior is given by the δ( θ,θfunctional, whose integral over any subset ofΩkof volumeΔ is unity This meansthat the posterior probability of the ML parameters, quantized to any precision, is

)-unity We might call it the canonical prior.

We have defined [17]

log 1/ f (y n ; k) = log 1/ f (y n; θ(y n ), k) + logC n,k (2.8)

as the stochastic complexity of the data y n, given the class of models (2.1) If themodel class satisfies the central limit theorem and other smoothness conditions thestochastic complexity is given by the decomposition [17]

logf (y

n; θ(y n ), k) f(y n ; k) =

There is a further asymptotic justification for stochastic complexity by the theorems

in [15] and [16], which imply, among other things, that f (y n ; k) is optimal universal

in the mean sense This means that there is no density function which converges

to the data generating density function f (y n;θ, k) faster than f (y n ; k) In particular,

no other estimator y n → ¯θ(y n) can give a smaller mean length for the normalized

density function f (y n; ¯θ(y n ), k) than the ML estimator.

We mention in conclusion that for discrete data the integral in the normalizing

coefficient becomes a sum, for which efficient algorithms for reasonable size of n

have been developed in a number of special cases [9, 10, 24–27] For such casesthe structure of the models can be learned much better than by minimization of theasymptotic criterion (2.9)

Trang 40

2.3.1.1 NML Model for ClassM

The minimization of the stochastic complexity (2.8) with respect to k is meaningful

both asymptotically and non-asymptotically For instance, for discrete data

log 1/ f (y n ; n) can equal logC n,n, which is larger than the minimized stochastic

complexity log 1/ f (y n; k) This is a bit baffling, because log 1/ f (y n; k) is not a

prefix code length; i.e., f (y n; k) is not a model To get a logical explanation as well

as to be able at least to define an optimal universal model for the class (2.2) let

y n → k(y n) = k denote the estimator that maximizes f (y n ; k), and hence minimizes

the joint code length

L(y n , k) = −log f(y n

max

g min

q E glog f(Y n; k(Y n))

q(Y n) .

The difficulty is to calculate the probabilities g( k; k), but for our purpose here we

do not need that, for we see that k, which minimizes the non-prefix code length log 1/ f (y n; k(y n )), also minimizes the prefix code length log 1/ f (y n)

Although f (y n ; k) provides an excellent criterion for the structure and the number

of parameters in the usual cases where k is not too large in comparison with n, it is

not good enough for denoising, where k is of the order of n Then we cannot late h( θ, θ), C n,k, nor the joint density, accurately enough One way in the regressionproblems, which include denoising problems, is to approximate the maximum jointdensity by two-part coding thus

calcu-log 1/ f (y n |X n; θk , k) + L(k), (2.13)

q D(g n ; k)) + logC...

θ∈Ω k f (y n; θ, k)w( θ)d θ, and the maximized posterior is given by the δ( θ,θfunctional, whose integral over any subset... n, given the class of models (2.1) If themodel class satisfies the central limit theorem and other smoothness conditions thestochastic complexity is given by the decomposition [17]

logf

Định dạng
Số trang	442
Dung lượng	9,67 MB