If we had a prediction model, R , that assigned a probability of S to the text, then it is possible to write a sequence of just−log2S bits, so that the original text, x, can be recovered
Trang 1Information Theory and Statistical Learning
Trang 2Information Theory
and Statistical Learning
ABC
Trang 3and Machine Learning
Center for Cancer Research
and Cell Biology
School of Biomedical Sciences
97 Lisburn Road, Belfast BT9 7BL, UK
v@bio-complexity.com
Matthias DehmerVienna University of TechnologyInstitute of Discrete Mathematicsand Geometry
Wiedner Hauptstr 8–10
1040 Vienna, Austriaand
University of CoimbraCenter for MathematicsProbability and StatisticsApartado 3008, 3001–454Coimbra, Portugalmatthias@dehmer.org
ISBN: 978-0-387-84815-0 e-ISBN: 978-0-387-84816-7
DOI: 10.1007/978-0-387-84816-7
Library of Congress Control Number: 2008932107
c
Springer Science+Business Media, LLC 2009
All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acid-free paper
springer.com
Trang 4This book presents theoretical and practical results of information theoretic methodsused in the context of statistical learning Its major goal is to advocate and promotethe importance and usefulness of information theoretic concepts for understandingand developing the sophisticated machine learning methods necessary not only tocope with the challenges of modern data analysis but also to gain further insights
into their theoretical foundations Here Statistical Learning is loosely defined as a
synonym, for, e.g., Applied Statistics, Artificial Intelligence or Machine Learning.Over the last decades, many approaches and algorithms have been suggested in thefields mentioned above, for which information theoretic concepts constitute coreingredients For this reason we present a selected collection of some of the finestconcepts and applications thereof from the perspective of information theory as the
underlying guiding principles We consider such a perspective as very insightful and
expect an even greater appreciation for this perspective over the next years.The book is intended for interdisciplinary use, ranging from Applied Statistics,Artificial Intelligence, Applied Discrete Mathematics, Computer Science, Infor-mation Theory, Machine Learning to Physics In addition, people working in thehybrid fields of Bioinformatics, Biostatistics, Computational Biology, Computa-tional Linguistics, Medical Bioinformatics, Neuroinformatics or Web Mining mightprofit tremendously from the presented results because these data-driven areas are
in permanent need of new approaches to cope with the increasing flood of dimensional, noisy data that possess seemingly never ending challenges for theiranalysis
high-Many colleagues, whether consciously or unconsciously, have provided us withinput, help and support before and during the writing of this book In particular wewould like to thank Shun-ichi Amari, Hamid Arabnia, G¨okhan Bakır, Alexandru T.Balaban, Teodor Silviu Balaban, Frank J Balbach, Jo˜ao Barros, Igor Bass, MatthiasBeck, Danail Bonchev, Stefan Borgert, Mieczyslaw Borowiecki, Rudi L Cilibrasi,Mike Coleman, Malcolm Cook, Pham Dinh-Tuan, Michael Drmota, Shinto Eguchi,
B Roy Frieden, Bernhard Gittenberger, Galina Glazko, Martin Grabner, EarlGlynn, Peter Grassberger, Peter Hamilton, Kateˇrina Hlav´aˇckov´a-Schindler, Lucas
R Hope, Jinjie Huang, Robert Jenssen, Attila Kert´esz-Farkas, Andr´as Kocsor,
v
Trang 5vi Preface
Elena Konstantinova, Kevin B Korb, Alexander Kraskov, Tyll Kr¨uger, Ming Li, J.F.McCann, Alexander Mehler, Marco M¨oller, Abbe Mowshowitz, Max M¨uhlh¨auser,Markus M¨uller, Noboru Murata, Arcady Mushegian, Erik P Nyberg, Paulo EduardoOliveira, Hyeyoung Park, Judea Pearl, Daniel Polani, S´andor Pongor, WilliamReeves, Jorma Rissanen, Panxiang Rong, Reuven Rubinstein, Rainer SiegmundSchulze, Heinz Georg Schuster, Helmut Schwegler, Chris Seidel, Fred Sobik, Ray
J Solomonoff, Doru Stefanescu, Thomas Stoll, John Storey, Milan Studeny, UlrichTamm, Naftali Tishby, Paul M.B Vit´anyi, Jos´e Miguel Urbano, Kazuho Watanabe,Dongxiao Zhu, Vadim Zverovich and apologize to all those who have been missedinadvertently We would like also to thank our editor Amy Brais from Springer whohas always been available and helpful Last but not least we would like to thank ourfamilies for support and encouragement during all the time of preparing the bookfor publication
We hope this book will help to spread the enthusiasm we have for this field andinspire people to tackle their own practical or theoretical research problems
Belfast and Coimbra Frank Emmert-Streib
Trang 61 Algorithmic Probability: Theory and Applications 1Ray J Solomonoff
2 Model Selection and Testing by the MDL Principle 25Jorma Rissanen
3 Normalized Information Distance 45Paul M B Vit´anyi, Frank J Balbach, Rudi L Cilibrasi, and Ming Li
4 The Application of Data Compression-Based Distances
to Biological Sequences 83Attila Kert´esz-Farkas, Andr´as Kocsor, and S´andor Pongor
5 MIC: Mutual Information Based Hierarchical Clustering 101Alexander Kraskov and Peter Grassberger
6 A Hybrid Genetic Algorithm for Feature Selection
Based on Mutual Information 125Jinjie Huang and Panxiang Rong
7 Information Approach to Blind Source Separation
and Deconvolution 153Pham Dinh-Tuan
8 Causality in Time Series: Its Detection and Quantification by Means
of Information Theory 183Kateˇrina Hlav´aˇckov´a-Schindler
9 Information Theoretic Learning and Kernel Methods 209Robert Jenssen
10 Information-Theoretic Causal Power 231Kevin B Korb, Lucas R Hope, and Erik P Nyberg
vii
Trang 714 Model Selection and Information Criterion 333Noboru Murata and Hyeyoung Park
15 Extreme Physical Information as a Principle of Universal Stability 355
B Roy Frieden
16 Entropy and Cloning Methods for Combinatorial Optimization,
Sampling and Counting Using the Gibbs Sampler 385Reuven Rubinstein
Index 435
Trang 8Frank J Balbach, University of Waterloo, Waterloo, ON, Canada,
Lucas R Hope, Bayesian Intelligence Pty Ltd., lhope@bayesian-intelligence.comKateˇrina Hlav´aˇckov´a-Schindler, Commission for Scientific Visualization,Austrian Academy of Sciences and Donau-City Str 1, 1220 Vienna, Austria andInstitute of Information Theory and Automation of the Academy of Sciences of theCzech Republic, Pod Vod´arenskou vˇeˇz´ı 4, 18208 Praha 8, Czech Republic,
katerina.schindler@assoc.oeaw.ac.at
Jinjie Huang, Department of Automation, Harbin University of Science andTechnology, Xuefu Road 52 Harbin 150080, China, jinjiehyh@yahoo.com.cnRobert Jenssen, Department of Physics and Technology, University of Tromsø,
9037 Tromso, Norway, robert.jenssen@phys.uit.no
ix
Trang 9x Contributors
Attila Kert´esz-Farkas, Research Group on Artificial Intelligence, Aradi v´ertan´uktere 1, 6720 Szeged, Hungary, kfa@inf.u-szeged.hu
Andr´as Kocsor, Research Group on Artificial Intelligence, Aradi v´ertan´uk tere 1,
6720 Szeged, Hungary, kocsor@inf.u-szeged.hu
Kevin B Korb, Clayton School of IT, Monash University, Clayton 3600, tralia, kevin.korb@infotech.monash.edu.au
Aus-Alexander Kraskov, UCL Institute of Neurology, Queen Square, London WC1N3BG, UK, akraskov@ion.ucl.ac.uk
Ming Li, University of Waterloo, Waterloo, ON, Canada, mli@uwaterloo.caMarco M¨oller, Adaptive Systems Research Group, School of Computer Science,University of Hertfordshire, Hatfield, UK XXX@herts.ac.uk
Noboru Murata, Waseda University, Tokyo 169-8555, Japan,
Jorma Rissanen, Helsinki Institute for Information Technology, TechnicalUniversities of Tampere and Helsinki, and CLRC, Royal Holloway, University
of London, London, UK, jorma.rissanen@hiit.fi
Panxiang Rong, Department of Automation, Harbin University of Science andTechnology, Xuefu Road 52 Harbin 150080, China, pxrong@hrbust.edu.cnReuven Rubinstein, Faculty of Industrial Engineering and Management,Technion, Israel Institute of Technology Haifa 32000, Israel,
Trang 10Algorithmic Probability: Theory
and Applications
Ray J Solomonoff
AbstractWe first define Algorithmic Probability, an extremely powerful method
of inductive inference We discuss its completeness, incomputability, diversity andsubjectivity and show that its incomputability in no way inhibits its use for practicalprediction Applications to Bernoulli sequence prediction and grammar discoveryare described We conclude with a note on its employment in a very strong AI systemfor very general problem solving
1.1 Introduction
Ever since probability was invented, there has been much controversy as to just what
it meant, how it should be defined and above all, what is the best way to predict thefuture from the known past Algorithmic Probability is a relatively recent definition
of probability that attempts to solve these problems
We begin with a simple discussion of prediction and its relationship to ity This soon leads to a definition of Algorithmic Probability (ALP) and its proper-
probabil-ties The best-known properties of ALP are its incomputibility and its completeness (in that order) Completeness means that if there is any regularity (i.e property use-
ful for prediction) in a batch of data, ALP will eventually find it, using a surprisingly
small amount of data The incomputability means that in the search for regularities,
at no point can we make a useful estimate of how close we are to finding the mostimportant ones We will show, however, that this incomputability is of a very benignkind, so that in no way does it inhibit the use of ALP for good prediction One of
the important properties of ALP is subjectivity, the amount of personal experiential
information that the statistician must put into the system We will show that this
R.J Solomonoff
Visiting Professor, Computer Learning Research Centre, Royal Holloway, University of London, London, UK
http://world.std.com/ rjs, e-mail: rjsolo@ieee.org
F Emmert-Streib, M Dehmer (eds.), Information Theory and Statistical Learning, 1 DOI: 10.1007/978-0-387-84816-7 1,
c
Springer Science+Business Media LLC 2009
Trang 112 R.J Solomonoff
is a desirable feature of ALP, rather than a “Bug” Another property of ALP is its
diversity – it affords many explanations of data giving very good understanding of
that data
There have been a few derivatives of Algorithmic Probability – MinimumMessage Length (MML), Minimum Description Length (MDL) and StochasticComplexity – which merit comparison with ALP
We will discuss the application of ALP to two kinds of problems: Prediction ofthe Bernoulli Sequence and Discovery of the Grammars of Context Free Languages
We also show how a variation of Levin’s search procedure can be used to search over
a function space very efficiently to find good predictive models
The final section is on the future of ALP – some open problems in its application
to AI and what we can expect from these applications
1.2 Prediction, Probability and Induction
What is Prediction?
“An estimate of what is to occur in the future” – But also necessary is a measure
of confidence in the prediction: As a negative example consider an early AI gram called “Prospector” It was given the characteristics of a plot of land and wasexpected to suggest places to drill for oil While it did indeed do that, it soon be-came clear that without having any estimate of confidence, it is impossible to knowwhether it is economically feasible to spend $100,000 for an exploratory drill rig.Probability is one way to express this confidence
pro-Say the program estimated probabilities of 0.1 for 1,000-gallon yield, 0.1 for 10,000-gallon yield and 0.1 for 100,000-gallon yield The expected yield would
be 0.1 ×1,000 + 0.1×10,000 + 0.1×100,000 = 11,100 gallons At $100 per gallon
this would give $1,110,000 Subtracting out the $100,000 for the drill rig gives an
expected profit of $1,010,000, so it would be worth drilling at that point The moral
is that predictions by themselves are usually of little value – it is necessary to haveconfidence levels associated with the predictions
A strong motivation for revising classical concepts of probability has come fromthe analysis of human problem solving When working on a difficult problem, aperson is in a maze in which he must make choices of possible courses of action Ifthe problem is a familiar one, the choices will all be easy If it is not familiar, therecan be much uncertainty in each choice, but choices must somehow be made Onebasis for choices might be the probability of each choice leading to a quick solution –this probability being based on experience in this problem and in problems like
it A good reason for using probability is that it enables us to use Levin’s SearchTechnique (Sect 1.11) to find the solution in near minimal time
The usual method of calculating probability is by taking the ratio of the number
of favorable choices to the total number of choices in the past If the decision to useintegration by parts in an integration problem has been successful in the past 43%
Trang 12of the time, then its present probability of success is about 0.43 This method has
very poor accuracy if we only have one or two cases in the past, and is undefined
if the case has never occurred before Unfortunately it is just these situations thatoccur most often in problem solving
On a very practical level: If we cross a particular street 10 times and we get hit
by a car twice, we might estimate that the probability of getting hit in crossing that
street is about 0.2 = 2/10 However, if instead, we only crossed that street twice and
we didn’t get hit either time, it would be unreasonable to conclude that our ity of getting hit was zero! By seriously revising our definition of probability, we areable to resolve this difficulty and clear up many others that have plagued classicalconcepts of probability
probabil-What is Induction?
Prediction is usually done by finding inductive models These are deterministic orprobabilistic rules for prediction We are given a batch of data – typically a series ofzeros and ones, and we are asked to predict any one of the data points as a function
of the data points that precede it
In the simplest case, let us suppose that the data has a very simple structure:
0101010101010
In this case, a good inductive rule is “zero is always followed by one; one isalways followed by zero” This is an example of deterministic induction, and deter-ministic prediction In this case it is 100% correct every time!
There is, however, a common kind of induction problem in which our predictionswill not be that reliable Suppose we are given a sequence of zeros and ones withvery little apparent structure The only apparent regularity is that zero occurs 70% ofthe time and one appears 30% of the time Inductive algorithms give a probabilityfor each symbol in a sequence that is a function of any or none of the previoussymbols In the present case, the algorithm is very simple and the probability of the
next symbol is independent of the past – the probability of zero seems to be 0.7; the probability of one seems to be 0.3 This kind of simple probabilistic sequence
is called a “Bernoulli sequence” The sequence can contain many different kinds ofsymbols, but the probability of each is independent of the past In Sect 1.9 we willdiscuss the Bernoulli sequence in some detail
In general we will not always be predicting Bernoulli sequences and there aremany possible algorithms (which we will call “models”) that tell how to assign
a probability to each symbol, based on the past Which of these should we use?Which will give good predictions in the future?
One desirable feature of an inductive model is that if it is applied to the known sequence, it produces good predictions Suppose R i is an inductive algorithm R i
predicts the probability of an symbol a j in a sequence a1, a2···a nby looking at theprevious symbols: More exactly,
Trang 134 R.J Solomonoff
p j = R i (a j |a1.a2···a j−1)
a j is the symbol for which we want the probability a1, a2···a j−1 are the previous
symbols in the sequence Then R iis able to give the probability of a particular value
of a j as a function of the past Here, the values of a j can range over the entire
“alphabet” of symbols that occur in the sequence If the sequence is a binary, a jwill
range over the set 0 and 1 only If the sequence is English text, a jwill range over
all alphabetic and punctuation symbols If R i is a good predictor, for most of the a j,the probability it assigns to them will be large – near one
Consider S, the product of the probabilities that R iassigns to the individual
sym-bols of the sequence, a1, a2···a n S will give the probability that R i assigns to thesequence as a whole
For good prediction we want S as large as possible The maximum value it can have
is one, which implies perfect prediction The smallest value it can have is zero –
which can occur if one or more of the p j are zero – meaning that the algorithmpredicted an event to be impossible, yet that event occurred!
The “Maximum Likelihood” method of model selection uses S only to decide
upon a model First, a set of models is chosen by the statistician, based on his perience with the kind of prediction being done The model within that set having
ex-maximum S value is selected.
Maximum Likelihood is very good when there is a lot of data – which is the area
in which classical statistics operates When there is only a small amount of data, it
is necessary to consider not only S, but the effect of the likelihood of the model itself
on model selection The next section will show how this may be done
1.3 Compression and ALP
An important application of symbol prediction is text compression If an induction
algorithm assigns a probability S to a text, there is a coding method – Arithmetic Coding – that can re-create the entire text without error using just −log2S bits More exactly: Suppose x is a string of English text, in which each character is represented by an 8-bit ASCII code, and there are n characters in x x would be directly represented by a code of just 8n bits If we had a prediction model, R , that assigned a probability of S to the text, then it is possible to write a sequence of
just−log2S bits, so that the original text, x, can be recovered from that bit sequence
without error
If R is a string of symbols (usually a computer program) that describes the
predic-tion model, we will use|R| to represent the length of the shortest binary sequence that describes R If S > 0, then the probability assigned to the text will be in two
Trang 14parts: the first part is the code for R, which is |R| bits long, and the second part is the code for the probability of the data, as given by R – it will be just −log2S bits
in length The sum of these will be|R|−log2S bits The compression ratio achieved
the length of the compressed code, as a “figure of merit” of a particular inductionalgorithm with respect to a particular text
We want an algorithm that will give good prediction, i.e large S, and small |R|,
so|R| − log2S, the figure of merit, will be as small as possible and the probability it
assigns to the text will be as large as possible Models with|R| larger than optimum are considered to be overfitted Models in which |R| are smaller than optimum are considered to be underfitted By choosing a model that minimizes |R| − log2S, we
avoid both underfitting and overfitting, and obtain very good predictions We willreturn to this topic later, when we tell how to compute|R| and S for particular models
and data sets
Usually there are many inductive models available In 1960, I described rithmic Probability – ALP [5–7], which uses all possible models in parallel forprediction, with weights dependent upon the figure of merit of each model
Algo-P M (a n+1 |a1, a2···a n) =∑2−|R i | S
i R i (a n+1 |a1, a2···a n) (1.2)
P M (a n+1 |a1, a2···a n ) is the probability assigned by ALP to the (n + 1)th symbol of
the sequence, in view of the previous part of the sequence
R i (a n+1 |a1, a2···a n ) is the probability assigned by the ith model to the (n + 1)th
symbol of the sequence, in view of the previous part of the sequence
S i is the probability assigned by R i , (the ith model) to the known sequence,
a1, a2···a nvia (1.1)
2−|R i | S
i is 1/2 with an exponent equal to the figure of merit that R i has with
respect to the data string a1, a2 a n It is the weight assigned to R i( ) This weight
is large when the figure of merit is good – i.e small
Suppose that|R i | is the shortest program describing the ith model using a lar “reference computer” or programming language – which we will call M Clearly
particu-the value of|R i | will depend on the nature of M We will be using machines (or
languages) that are “Universal” – machines that can readily program any able function – almost all computers and programming languages are of this kind
conceiv-The subscript M in P Mexpresses the dependence of ALP on choice of the referencecomputer or language
The universality of M assures us that the value of ALP will not depend very much
on just which M we use – but the dependence upon M is nonetheless important It
will be discussed at greater length in Sect 1.5 on “Subjectivity”
Trang 156 R.J Solomonoff
Normally in prediction problems we will have some time limit, T , in which we
have to make our prediction In ALP what we want is a set of models of maximumtotal weight A set of this sort will give us an approximation that is as close aspossible to ALP and gives best predictions To obtain such a set, we devise a search
technique that tries to find, in the available time, T , a set of Models, R i, such thatthe total weight,
com-Does ALP have any advantages over other probability evaluation methods? For
one, it’s the only method known to be complete The completeness property of ALP
means that if there is any regularity in a body of data, our system is guaranteed todiscover it using a relatively small sample of that data More exactly, say we had
some data that were generated by an unknown probabilistic source, P Not knowing
P, we use instead, P M to obtain the Algorithmic Probabilities of the symbols in
the data How much do the symbol probabilities computed by P M differ from their
true probabilities, P? The Expected value with respect to P of the total square error between P and P M is bounded by−1/2lnP0
This is an extremely small error rate The error in probability approaches zero
more rapidly than 1/n Rapid convergence to correct probabilities is a most portant feature of ALP The convergence holds for any P that is describable by a computer program and includes many functions that are formally incomputable.
im-Various kinds of functions are described in the next section The convergence proof
is in Solomonoff [8]
1.4 Incomputability
It should be noted that in general, it is impossible to find the truly best modelswith any certainty – there is an infinity of models to be tested and some take anunacceptably long time to evaluate At any particular time in the search, we willknow the best ones so far, but we can’t ever be sure that spending a little more
Trang 16time will not give much better models! While it is clear that we can always make
approximations to ALP by using a limited number of models, we can never know
how close these approximations are to the “True ALP” ALP is indeed, formally incomputable.
In this section, we will investigate how our models are generated and how the
incomputability comes about – why it is a necessary, desirable feature of any high performance prediction technique, and how this incomputability in no way inhibits
its use for practical prediction
How Incomputability Arises and How We Deal with It
Recall that for ALP we added up the predictions of all models, using suitableweights:
There are just four kinds of functions that R ican be:
1 Finite compositions of a finite set of functions
2 Primitive recursive functions
3 Partial recursive functions
4 Total recursive functions
Compositions are combinations of a small set of functions The finite power series
3.2 + 5.98 ∗ X − 12.54 ∗ X2
+ 7.44 ∗ X3
is a composition using the functions plus and times on the real numbers Finite
series of this sort can approximate any continuous functions to arbitrary precision
Primitive Recursive Functions are defined by one or more DO loops For example
to define Factorial(X ) we can write
Factorial(0) ← 1
DO I = 1, X
Factorial(I) ← I ∗ Factorial(I − 1)
EndDO
Partial Recursive Functions are definable using one or more W HILE loops For
example, to define the factorial in this way:
Trang 178 R.J Solomonoff
The loop will terminate if X is a non negative integer For all other values of X , the loop will run forever In the present case it is easy to tell for which values of X
the loop will terminate
A simple W HILE loop in which it is not so easy to tell:
W HILE X > 4
IF X /2 is an integer T HEN X ← X/2
ELSE X ← 3 ∗ X + 1
EndW HILE
This program has been tested with X starting at all positive integers up to more
than sixty million The loop has always terminated, but no one yet is certain as to
whether it terminates for all positive integers!
For any Total Recursive Function we know all values of arguments for which
the function has values Compositions and primitive recursive functions are all totalrecursive Many partial recursive functions are total recursive, but some are not As
a consequence of the insolvability of Turing’s “Halting Problem”, it will sometimes
be impossible to tell if a certain W HILE loop will terminate or not.
Suppose we use (1.2) to approximate ALP by sequentially testing functions in alist of all possible functions – these will be the partial recursive functions because
this is the only recursively enumerable function class that includes all possible
pre-dictive functions As we test to find functions with good figures of merit (small(|R i | − log2S i )) we find that certain of them don’t converge after say, a time T , of
10 s We know that if we increase T enough, eventually, all converging trials will
converge and all divergent trials will still diverge – so eventually we will get close
to true ALP – but we cannot recognize when this occurs Furthermore for any
fi-nite T , we cannot ever know a useful upper bound on how large the error in the
ALP approximation is That is why this particular method of approximating ALP
is called “incomputable” Could there be another computable approximation nique that would converge? It is easy to show that any computable technique cannot
tech-be “complete” – i.e having very small errors in probability estimates
Consider an arbitrary computable probability method, R0 We will show how to
generate a sequence for which R0’s errors in probability would always be 0.5 or more We start our sequence with a single bit, say zero We then ask R0for the mostprobable next bit If it says “one is more probable”, we make the continuation zero, if
it says “zero is more probable”, we make the next bit one If it says “both are equallylikely” we make the next bit zero We generate the third bit in the sequence in thesame way, and we can use this method to generate an arbitrarily long continuation
of the initial zero
For this sequence, R0 will always have an error in probability of at least one
half Since completeness implies that prediction errors approach zero for all finitely describable sequences, it is clear that R0or any other computable probability method cannot be complete Conversely, any complete probability method, such as ALP, cannot be computable.
If we cannot compute ALP, what good is it? It would seem to be of little valuefor prediction! To answer this objection, we note that from a practical viewpoint, we
Trang 18never have to calculate ALP exactly – we can always use approximations While
it is impossible to know how close our approximations are to the true ALP, that information is rarely needed for practical induction.
What we actually need for practical prediction:
1 Estimates of how good a particular approximation will be in future problems(called “Out of Sample Error”)
2 Methods to search for good models
3 Quick and simple methods to compare models
For 1., we can use Cross Validation or Leave One Out – well-known methods that
work with most kinds of problems In addition, because ALP does not overfit orunderfit there is usually a better method to make such estimates
For 2 In Sect 1.11 we will describe a variant of Levin’s Search Procedure, for
an efficient search of a very large function space
For 3., we will always find it easy to compare models via their associated
“Figures of Merit”,|R i | − log2(S i)
In summary, it is clear that all computable prediction methods have a seriousflaw – they cannot ever approach completeness On the other hand, while approxi-
mations to ALP can approach completeness, we can never know how close we are
to the final, incomputable result We can, however, get good estimates of the
fu-ture error in our approximations, and this is all that we really need in a practical
prediction system
That our approximations approach ALP assures us that if we spend enough time
searching we will eventually get as little error in prediction as is possible No putable probability evaluation method can ever give us this assurance It is in this sense that the incomputability of ALP is a desirable feature.
com-1.5 Subjectivity
The subjectivity of probability resides in a priori information – the informationavailable to the statistician before he sees the data to be extrapolated This is in-dependent of what kind of statistical techniques we use In ALP this a priori in-
formation is embodied in M, our “Reference Computer” Recall our assignment of
a|R| value to an induction model – it was the length of the program necessary to describe R In general, this will depend on the machine we use – its instruction set.
Since the machines we use are Universal – they can imitate one another – the length
of description of programs will not vary widely between most reference machines
we might consider But nevertheless, using small samples of data (as we often do inAI), these differences between machines can modify results considerably
For quite some time I felt that the dependence of ALP on the reference machinewas a serious flaw in the concept, and I tried to find some “objective” universal
Trang 1910 R.J Solomonoff
device, free from the arbitrariness of choosing a particular universal machine When
I thought I finally found a device of this sort, I realized that I really didn’t want it –that I had no use for it at all! Let me explain:
In doing inductive inference, one begins with two kinds of information: First, thedata itself, and second, the a priori data – the information one had before seeingthe data It is possible to do prediction without data, but one cannot do predictionwithout a priori information In choosing a reference machine we are given the op-portunity to insert into the a priori probability distribution any information about thedata that we know before we see it
If the reference machine were somehow “objectively” chosen for all inductionproblems, we would have no way to make use of our prior information This lack
of an objective prior distribution makes ALP very subjective – as are all Bayesiansystems
This certainly makes the results “subjective” If we value objectivity, we canroutinely reduce the choice of a machine and representation to certain universal
“default” values – but there is a tradeoff between objectivity and accuracy To obtainthe best extrapolation, we must use whatever information is available, and much ofthis information may be subjective
Consider two physicians, A and B: A is a conventional physician: He diagnoses
ailments on the basis of what he has learned in school, what he has read about
and his own experience in treating patients B is not a conventional physician He is
“objective” His diagnosis is entirely “by the book” – things he has learned in schoolthat are universally accepted He tries as hard as he can to make his judgements free
of any bias that might be brought about by his own experience in treating patients
As a lawyer, I might prefer defending B’s decisions in court, but as a patient, I would prefer A’s intelligently biased diagnosis and treatment.
To the extent that a statistician uses objective techniques, his recommendationsmay be easily defended, but for accuracy in prediction, the additional informationafforded by subjective information can be a critical advantage
Consider the evolution of a priori in a scientist during the course of his life Hestarts at birth with minimal a priori information – but enough to be able to learn towalk, to learn to communicate and his immune system is able to adapt to certainhostilities in the environment Soon after birth, he begins to solve problems andincorporate the problem solving routines into his a priori tools for future problemsolving This continues throughout the life of the scientist – as he matures, his apriori information matures with him
In making predictions, there are several commonly used techniques for inserting
a priori information First, by restricting or expanding the set of induction models to
be considered This is certainly the commonest way Second, by selecting predictionfunctions with adjustable parameters and assuming a density distribution over thoseparameters based on past experience with such parameters Third, we note that much
of the information in our sciences is expressed as definitions – additions to ourlanguage ALP, or approximations of it, avails itself of this information by usingthese definitions to help assign code lengths, and hence a priori probabilities tomodels Computer languages are usually used to describe models, and it is relativelyeasy to make arbitrary definitions part of the language
Trang 20More generally, modifications of computer languages are known to be able toexpress any conceivable a priori probability distributions This gives us the ability
to incorporate whatever a priori information we like into our computer language
It is certainly more general than any of the other methods of inserting a prioriinformation
1.6 Diversity and Understanding
Apart from accuracy of probability estimate, ALP has for AI another importantvalue: Its multiplicity of models gives us many different ways to understand ourdata
A very conventional scientist understands his science using a single “current
paradigm” – the way of understanding that is most in vogue at the present time
A more creative scientist understands his science in very many ways, and can more
easily create new theories, new ways of understanding, when the “current paradigm”
no longer fits the current data
In the area of AI in which I’m most interested – Incremental Learning – thisdiversity of explanations is of major importance At each point in the life of theSystem, it is able to solve with acceptable merit, all of the problems it’s been giventhus far We give it a new problem – usually its present Algorithm is adequate.Occasionally, it will have to be modified a bit But every once in a while it gets a
problem of real difficulty and the present Algorithm has to be seriously revised At such times, we try using or modifying once sub-optimal algorithms If that doesn’t
work we can use parts of the sub-optimal algorithms and put them together in newways to make new trial algorithms It is in giving us a broader basis to learn fromthe past, that this value of ALP lies
1.6.1 ALP and “The Wisdom of Crowds”
It is a characteristic of ALP that it averages over all possible models of the data:There is evidence that this kind of averaging may be a good idea in a more generalsetting “The Wisdom of Crowds” is a recent book by James Serowiecki that in-vestigates this question The idea is that if you take a bunch of very different kinds
of people and ask them (independently) for a solution to a difficult problem, then asuitable average of their solutions will very often be better than the best in the set
He gives examples of people guessing the number of beans in a large glass bottle, orguessing the weight of a large ox, or several more complex, very difficult problems
He is concerned with the question of what kinds of problems can be solved thisway as well as the question of when crowds are wise and when they are stupid.They become very stupid in mobs or in committees in which a single person is able
to strongly influence the opinions in the crowd In a wise crowd, the opinions are
Trang 2112 R.J Solomonoff
individualized, the needed information is shared by the problem solvers, and theindividuals have great diversity in their problem solving techniques The methods
of combining the solutions must enable each of the opinions to be voiced These
conditions are very much the sort of thing we do in ALP Also, when we approximate ALP we try to preserve this diversity in the subset of models we use.
1.7 Derivatives of ALP
After my first description of ALP in 1960 [5], there were several related tion models described, minimum message length (MML) Wallace and Boulton[13], Minimum Description Length (MDL) Rissanen [3], and Stochastic Complex-ity, Rissanen [4] These models were conceived independently of ALP – (thoughRissanen had read Kolmogorov’s 1965 paper on minimum coding [1], which isclosely related to ALP) MML and MDL select induction models by minimizingthe figure of merit,|R i | − log2(S i) just as ALP does However, instead of using aweighted sum of models, they use only the single best model
induc-MDL chooses a space of computable models then selects the best model fromthat space This avoids any incomputability, but greatly limits the kinds of modelsthat it can use MML recognizes the incomputability of finding the best model so
it is in principle much stronger than MDL Stochastic complexity, like MDL, firstselects a space of computable models – then, like ALP it uses a weighted sum ofall models in the that space Like MDL, it differs from ALP in the limited types ofmodels that are accessible to it MML is about the same as ALP when the best model
is much better than any other found When several models are of comparable figure
of merit, MML and ALP will differ One advantage of ALP over MML and MDL is
in its diversity of models This is useful if the induction is part of an ongoing process
of learning – but if the induction is used on one problem only, diversity is of muchless value Stochastic Complexity, of course, does obtain diversity in its limited set
of models
1.8 Extensions of ALP
The probability distribution for ALP that I’ve shown is called “The Universal
Dis-tribution for sequential prediction” There are two other universal disDis-tributions I’d
like to describe:
1.8.1 A Universal Distribution for an Unordered Set of Strings
Suppose we have a corpus of n unordered discrete objects, each described by a finite string a : Given a new string, a , what is the probability that it is in the
Trang 22previous set? In MML and MDL, we consider various algorithms, R i, that assign
probabilities to strings (We might regard them as Probabilistic Grammars) We use for prediction, the grammar, R i, for which
|R i | − log2S i (1.6)
is minimum Here|R i | is the number of bits in the description of the grammar, R i
S i=∏
j
R i (a j ) is the probability assigned to the entire corpus by R i If R kis the best
stochastic grammar that we’ve found, then we use R k (a n+1) as the probability of
a n+1 To obtain the ALP version, we simply sum over all models as before, usingweights 2−|R i | S
i
This kind of ALP has an associated convergence theorem giving very small errors
in probability This approach can be used in linguistics The a jcan be examples ofsentences that are grammatically correct We can use|R i | − log2S i as a likelihood
that the data was created by grammar, R i Section 1.10 continues the discussion ofGrammatical Induction
1.8.2 A Universal Distribution for an Unordered Set
of Ordered Pairs of Strings
This type of induction includes almost all kinds of prediction problems as “special
cases” Suppose you have a set of question answer pairs, Q1, A1; Q2, A2; Q n , A n:
Given a new question, Q n+1, what is the probability distribution over possible
an-swers, A n+1? Equivalently, we have an unknown analog and/or digital transducer,
and we are given a set of input/output pairs Q1, A1; – For a new input Q i, what
is probability distribution on outputs? Or, say the Q iare descriptions of mushrooms
and the A iare whether they are poisonous or not
As before, we hypothesize operators R j (A |Q) that are able to assign a probability
to any A given any Q: The ALP solution is
Trang 2314 R.J Solomonoff
This ALP system has a corresponding theorem for small errors in probability
As before, we try to find a set of models of maximum weight in the available time.Proofs of convergence theorems for these extensions of ALP are in Solomonoff [10].There are corresponding MDL, MML versions in which we pick the single model
of maximum weight
1.9 Coding the Bernoulli Sequence
First, consider a binary Bernoulli sequence of length n It’s only visible regularity
is that zeroes have occurred n0times and ones have occurred n1times One kind of
model for this data is that the probability of 0 is p and the probability of 1 is 1 − p Call this model R p S p is the probability assigned to the data by R p
S p = p n0(1− p) n1. (1.8)Recall that ALP tells us to sum the predictions of each model, with weight given bythe product of the a priori probability of the model (2−|R i | ) and S
i, the probabilityassigned to the data by the model , i.e.:
∑
i
2−|R i | S
In summing we consider all models with 0≤ p ≤ 1.
We assume for each model, R p, precisionΔ in describing p So p is specified
with accuracy,Δ We have Δ1 models to sum so total weight is 1
We can get about the same result another way: The function p n0(1− p) n1 is (if
n0and n1are large), narrowly peaked at p0= n0
n0+n1 If we used MDL we would use
the model with p = p The a priori probability of the model itself will depend on
Trang 24how accurately we have to specify p0 If the “width” of the peaked distribution isΔ,
then the a priori probability of model M p0 will be justΔ· p n0
p0(1−p0 )
n0+n1+1· p n0
0(1− p0)n1· 2 If we use Sterling’s approximation for n! (n! ≈ e −n n n √
2πn), it is not difficult to show that
2π= 2.5066 which is roughly equal to 2.
To obtain the probability of a zero following a sequence of n0zeros and n1ones:
We divide the probability of the sequence having the extra zero, by the probability
of the sequence without the extra zero, i.e.:
(n0+n1+1)! can be generalizedfor an alphabet of k symbols
A sequence of k different kinds of symbols has a probability of
n i is the number of times the ith symbol occurs.
This formula can be obtained by integration in a k − 1 dimensional space of the function p n1
1 p n2
2 ··· p n k−1
k−1(1− p1− p2··· − p k −1)n k.Through an argument similar to that used for the binary sequence, the probability
of the next symbol being of the jth type is
n j+ 1
k +∑k
A way to visualize this result: the body of data (the “corpus”) consists of the∑n i
symbols Think of a “pre-corpus” containing one of each of the k symbols If we
think of a “macro corpus” as “corpus plus pre-corpus” we can obtain the probability
of the next symbol being the jth one by dividing the number of occurrences of that
symbol in the macro corpus, by the total number of symbols of all types in the macrocorpus
1 This can be obtained by getting the first and second moments of the distribution, using the fact that
1
p x(1− p) y d p = (x+y+1)! x!y! .
Trang 2516 R.J Solomonoff
It is also possible to have different numbers of each symbol type in the corpus, enabling us to get a great variety of “a priori probability distributions” forour predictions
pre-1.10 Context Free Grammar Discovery
This is a method of extrapolating an unordered set of finite strings: Given the set of
strings, a1, a2, ···a n , what is the probability that a new string, a n+1, is a member ofthe set? We assume that the original set was generated by some sort of probabilisticdevice We want to find a device of this sort that has a high a priori likelihood (i.e.short description length) and assigns high probability to the data set A good model
R i, is one with maximum value of
Here P(R i ) is the a priori probability of the model R i R i (a j) is the probability
as-signed by R i to data string, a j
To understand probabilistic models, we first define non-probabilistic grammars.
In the case of context free grammars, this consists of a set of terminal symbols and
a set of symbols called nonterminals, one of which is the initial starting symbol, S.
A grammar could then be:
we perform either of the two possible substitutions If we choose BaAd, we would then have to choose substitutions for the nonterminals B and A For B, if we chose aBa we would again have to make a choice for B If we chose a terminal symbol, like b for B, then no more substitutions can be made.
An example of a string generation sequence:
S, BaAd, aBaaAd, abaaAd, abaaABd, abaaaBd, abaaabd.
The string abaaabd is then a legally derived string from this grammar The set of all strings legally derivable from a grammar is called the language of the grammar.
The language of a grammar can contain a finite or infinite number of strings If
Trang 26we replace the deterministic substitution rules with probabilistic rules, we have a
probabilistic grammar A grammar of this sort assigns a probability to every string
it can generate In the deterministic grammar above, S had two rewrite choices, A had three, and B had two If we assign a probability to each choice, we have a
A 0.3 BAaS 0.2 AB 0.5 a
B 0.4 aBa 0.6 b
In the derivation of abaaab of the previous example, the substitutions would have probabilities 0.9 to get BaAd, 0.4 to get aBaaAd, 0.6 to get abaaAd, 0.2 to get abaaABd, 0.5 to get abaaaBd, and 0.6 to get abaaabd The probability of the string abaabd being derived this way is 0.9×0.4×0.6×0.2×0.5×0.6 = 0.01296 Often
there are other ways to derive the same string with a grammar, so we have to add
up the probabilities of all of its possible derivations to get the total probability of astring
Suppose we are given a set of strings, ab, aabb, aaabbb that were generated by
an unknown grammar How do we find the grammar?
I wouldn’t answer that question directly, but instead I will tell how to find asequence of grammars that fits the data progressively better The best one we findmay not be the true generator, but will give probabilities to strings close to thosegiven by the generator
The example here is that of A Stolcke’s, PhD thesis [12] We start with an ad
hoc grammar that can generate the data, but it over f its it is too complex:
S → ab
→ aabb
→ aaabbb
We then try a series of modifications of the grammar (Chunking and Merging) that
increase the total probability of description and thereby decrease total description
length Merging consists of replacing two nonterminals by a single nonterminal Chunking is the process of defining new nonterminals We try it when a string or substring has occurred two or more times in the data ab has occurred three times so
we define X = ab and rewrite the grammar as
Trang 27At this point there are no repeated strings or substrings, so we try the operation
Merge which coalesces two nonterminals In the present case merging S and Y
would decrease complexity of the grammar, so we try:
S → X
→ aSb
→ aXb
X → ab Next, merging S and X gives
S → aSb
→ ab which is an adequate grammar At each step there are usually several possible chunk
or merge candidates We chose the candidates that give minimum description length
to the resultant grammar
How do we calculate the length of description of a grammar and its description
of the data set?
Consider the grammar
Trang 28(other than the first one, S) are not relevant We can describe the right hand side
by the string X s1Y s1aY bs1s2abs1s2aX bs1s2 s1and s2are punctuation symbols s1
marks the end of a string s2marks the end of a sequence of strings that belong tothe same nonterminal The string to be encoded has seven kinds of symbols The
number of times each occurs: X , 2; Y , 2; S, 0; a, 3; b, 3; s1, 5; s2, 3 We can then usethe formula
to compute the probability of the grammar: k = 7, since there are seven symbols and
n1= 2, n2= 2, n3= 0, n4= 3, etc We also have to include the probability of 2, thenumber of kinds of terminals, and of 3, the number of kinds of nonterminals.There is some disagreement in the machine learning community about how best
to assign probability to integers, n A common form is
approx-Its first moment is infinite, which means it is very biased toward large numbers If
we have reason to believe, from previous experience, that n will not be very large,
but will be about λ, then a reasonable form of P(n) might be P(n) = Ae −n/λ, A
being a normalization constant
The forgoing enables us to evaluate P(R i) of (1.16) The∏n
j=1 R i (a j) part is uated by considering the choices made when the grammar produces the data corpus.For each nonterminal, we will have a sequence of decisions whose probabilities can
eval-be evaluated by an expression like (1.14), or perhaps the simpler technique of (1.15)that uses the “pre-corpus” Since there are three nonterminals, we need the product
of three such expressions
The process used by Stolcke in his thesis was to make various trials of ing or merging in attempts to successively get a shorter description length – or toincrease (1.16) Essentially a very greedy method He has been actively working
chunk-on Cchunk-ontext Free Grammar discovery since then, and has probably discovered manyimprovements There are many more recent papers at his website
Most, if not all CFG discovery has been oriented toward finding a single best grammar For applications in AI and genetic programming it is useful to have large sets of not necessarily best grammars – giving much needed diversity One way to
implement this: At each stage of modification of a grammar, there are usually eral different operations that can reduce description length We could pursue suchpaths in parallel perhaps retaining the best 10 or best 100 grammars thus far
Trang 29sev-20 R.J Solomonoff
Branches taken early in the search could lead to very divergent paths and much
needed diversity This approach helps avoid local optima in grammars and ture convergence when applied to Genetic Programming.
prema-1.11 Levin’s Search Technique
In the section on incomputability we mentioned the importance of good searchtechniques for finding effective induction models The procedure we will describewas inspired by Levin’s search technique [2], but is applied to a different kind ofproblem
Here, we have a corpus of data to extrapolate, and we want to search over a
function space, to find functions (“models”) R i( ) such that 2−|R i | S
i is as large as
possible In this search, for some R i , the time needed to evaluate S i, (the probability
assigned to the corpus by R i), may be unacceptably large – possibly infinite
Suppose we have a (deterministic) context free grammar, G, that can generate
strings that are programs in some computer language (Most computer languageshave associated grammars of this kind.) In generating programs, the grammar will
have various choices of substitutions to make If we give each substitution in a k-way choice, a probability of 1/k, then we have a probabilistic grammar that assigns a priori probabilities to the programs that it generates If we use a functional language
(such as LISP), this will give a probability distribution over all functions it can
gen-erate The probability assigned to the function R i will be denoted by P M (R i) Here
M is the name of the functional computer language P M (R i) corresponds to what wecalled 2−|R i |in our earlier discussions.|R i | corresponds to −log2P M (R i) As before,
S i is the probability assigned to the corpus by R i We want to find functions R i( )
such that P M (R i )S iis as large as possible
Next we choose a small initial time T – which might be the time needed to execute 10 instructions in our Functional Language The initial T is not critical We then compute P M (R i )S i for all R i for which t i /P M (R i ) < T Here t iis the time needed
to construct R i and evaluate its S i
There are only a finite number of R i that satisfy this criterion and if T is very small, there will be very few, if any We remember which R i ’s have large P M (R i )S i
t i < T · P M (R i), so∑i t i , the total time for this part of the search takes time < T ·
∑i P M (R i ) Since the P M (R i) are a priori probabilities,∑i P M (R i) must be less than
or equal to 1, and so the total time for this part of the search must be less than T
If we are not satisfied with the R i we’ve obtained, we double T and do the search again We continue doubling T and searching until we find satisfactory R i’s or until
we run out of time If T is the value of T when we finish the search, then the total time for the search will be T + T /2 + T /4 ··· ≈ 2T .
If it took time t j to generate and test one of our “good” models, R j , then when R j
was discovered, T would be no more than 2t j /P M (R j) – so we would take no more
time than twice this, or 4t j /P M (R j ) to find R j Note that this time limit depends on
R only, and is independent of the fact that we may have aborted the S evaluations
Trang 30of many R i for which t i was infinite or unacceptably large This feature of LevinSearch is a mandatory requirement for search over a space of partial recursive func-tions Any weaker search technique would seriously limit the power of the inductivemodels available to us.
When ALP is being used in AI, we are solving a sequence of problems of
increas-ing difficulty The machine (or language) M is periodically “updated” by insertincreas-ing subroutines and definitions, etc., into M so that the solutions, R ito problems in the
past result in larger P M (R j ) As a result the t j /P M (R j) are smaller – giving quickersolutions to problems of the past – and usually for problems of the future as well
1.12 The Future of ALP: Some Open Problems
We have described ALP and some of its properties:
First, its completeness: Its remarkable ability to find any irregularities in an
ap-parently small amount of data
Second: That any complete induction system like ALP must be formally incomputable.
Third: That this incomputability imposes no limit on its use for practical
induc-tion This fact is based on our ability to estimate the future accuracy of any particularinduction model While this seems to be easy to do in ALP without using Cross Val-idation, more work needs to be done in this area
ALP was designed to work on difficult problems in AI The particular kind of AIconsidered was a version of “Incremental Learning”: We give the machine a simpleproblem Using Levin Search, it finds one or more solutions to the problem Thesystem then updates itself by modifying the reference machine so that the solutionsfound will have higher a priori probabilities We then give it new problems some-what similar to the previous problem Again we use Levin Search to find solutions –
We continue with a sequence of problems of increasing difficulty, updating aftereach solution is found As the training sequence continues we expect that we willneed less and less care in selecting new problems and that the system will eventu-ally be able to solve a large space of very difficult problems For a more detaileddescription of the system, see Solomonoff [11]
The principal things that need to be done to implement such a system:
* We have to find a good reference language: Some good candidates are APL, LISP,FORTH, or a subset of Assembly Language These languages must be augmentedwith definitions and subroutines that we expect to be useful in problem solving
* The design of good training sequences for the system is critical for getting muchproblem-solving ability into it I have written some general principles on how to
do this [9], but more work needs to be done in this area For early training, itmight be useful to learn definitions of instructions from Maple or Mathematica.For more advanced training we might use the book that Ramanujan used to teachhimself mathematics – “A Synopsis of Elementary Results in Pure and AppliedMathematics” by George S Carr
Trang 3122 R.J Solomonoff
* We need a good update algorithm It is possible to use PPM, a relatively fast, fective method of prediction, for preliminary updating, but to find more complexregularities, a more general algorithm is needed The universality of the referencelanguage assures us that any conceivable update algorithm can be considered.APL’s diversity of solutions to problems maximizes the information that we areable to insert into the a priori probability distribution After a suitable trainingsequence the system should know enough to usefully work on the problem ofupdating itself
ef-Because of ALP’s completeness (among other desirable properties), we expect that the complete AI system described above should become an extremely powerful
general problem solving device – going well beyond the limited functional
capabil-ities of current incomplete AI systems.
3 Rissanen, J.: Modeling by the shortest data description Automatica 14, 465–471 (1978)
4 Rissanen, J.: Stochastical Complexity and Statistical Inquiry World Scientific, Singapore (1989)
5 Solomonoff, R.J.: A preliminary report on a general theory of inductive inference (Revision
of Report V–131, Feb 1960), Contract AF 49(639)–376, Report ZTB–138 Zator, Cambridge (Nov, 1960) (http://www.world.std.com/˜rjs/pubs.html)
6 Solomonoff, R.J.: A formal theory of inductive inference, Part I Information and Control 7(1), 1–22 (1964)
7 Solomonoff, R.J.: A formal theory of inductive inference, Part II Information and Control 7(2), 224–254 (1964)
8 Solomonoff, R.J.: Complexity-based induction systems: comparisons and convergence rems IEEE Transactions on Information Theory IT–24(4), 422–432 (1978)
theo-9 Solomonoff, R.J.: A system for incremental learning based on algorithmic probability In: Proceedings of the Sixth Israeli Conference on Artificial Intelligence, Computer Vision and Pattern Recognition 515–527 (Dec 1989)
10 Solomonoff, R.J.: Three kinds of probabilistic induction: universal distributions and gence theorems Appears in Festschrift for Chris Wallace (2003)
conver-11 Solomonoff, R.J.: Progress in incremental machine learning TR IDSIA-16-03, revision 2.0 (2003)
12 Stolcke, A.: On learning context free grammars PhD Thesis (1994)
13 Wallace, C.S and Boulton, D.M.: An information measure for classification Computer Journal
11, 185–195 (1968)
Trang 32Shan, Y., McKay, R.I., Baxter, R., et al.: Grammar Model-Based Program Evolution (Dec 2003)
A recent review of work in this area, and what looks like a very good learning system Discusses mechanics of fitting Grammar to Data, and how to use Grammars to guide Search Problems Solomonoff, R.J.: The following papers are all available at the website: world.std.com/ rjs/ pubs.html.
Stolcke, A., Omohundro, S.: Inducing Probabilistic Grammars by Bayesian Model Merging ICSI, Berkeley (1994) This is largely a summary of Stolcke’s: On Learning Context Free Grammars [12].
A Preliminary Report on a General Theory of Inductive Inference (1960).
A Formal Theory of Inductive Inference Information and Control, Part I (1964).
A Formal Theory of Inductive Inference, Part II (June 1964) – Discusses fitting of context free grammars to data Most of the discussion is correct, but Sects 4.2.4 and 4.3.4 are questionable and equations (49) and (50) are incorrect.
A Preliminary Report and A Formal Theory give some intuitive justification for the way ALP does induction.
The Discovery of Algorithmic Probability (1997) – Gives heuristic background for discovery of ALP Page 27 gives a time line of important publications related to development of ALP Progress in Incremental Machine Learning; Revision 2.0 (Oct 30, 2003) – A more detailed de- scription of the system I’m currently working on There have been important developments since, however.
The Universal Distribution and Machine Learning (2003) – Discussion of irrelevance of putability to applications for prediction Also discussion of subjectivity.
Trang 33incom-Chapter 2
Model Selection and Testing by the MDL
Principle
Jorma Rissanen
AbstractThis chapter is an outline of the latest developments in the MDL theory
as applied to the selection and testing of statistical models Finding the number of
parameters is done by a criterion defined by an MDL based universal model, while
the corresponding optimally quantized real valued parameters are determined by theso-called structure function following Kolmogorov’s idea in the algorithmic theory
of complexity Such models are optimally distinguishable, and they can be tested
also in an optimal manner, which differs drastically from the Neyman–Pearson ing theory
test-2.1 Modeling Problem
A data generating physical machinery imposes restrictions or properties on data
In statistics we are interested in statistical properties, describable by
distribu-tions as models that can be fitted to a set of data x n = x1, , x n or (y n , x n) =
(y1, x1), , (y n , x n ), in the latter case conditional models to data y ngiven other data
x n This case adds little to the discussion, and to simplify the notations we consideronly the first type of data with one exception on regression
Trang 34can be fitted to data Typically, we have a set of n parametersθ1,θ2, ,θn, but we
wish to fit sub collections of these – not necessarily the k first Each sub collection
would define a structure To simplify the notations we consider the structure
de-fined by the first k parameters in some sorting of all the parameters We also write
θ=θkwhen the number of the parameters needs to be emphasized, in which case
f (x n;θ, k) is written as f (x n;θk) Finally, the classM can be made nested if we identify two models f (x n;θk ) and f (x n;θk+1) ifθk+1=θk , 0 This can be some-
times useful
There is no unique way to express the statistical properties of a physical ery as a distribution, hence, no unique “true” model As an example, a coin has alot of properties such as its shape and size By throwing it we generate data thatreflect statistical properties of the coin together with the throwing mechanism Even
machin-in this simple case it seems clear that these are not unique for they depend on manyother things that we have no control over All we can hope for is to fit a model, such
as a Bernoulli model, which gives us some information about the coin’s statisticalbehavior At any rate, to paraphrase Laplace’ statement that he needed no axiom ofGod in his celestial mechanics, we want a theory where the “true” model assumption
is not needed
In fitting models to data a yardstick is needed to measure the fitting error In ditional statistics, where a “true” model is hypothesized, the fitting error can be de-fined as the mean difference of the observed result from the true one, which howevermust be estimated from the data, because the “true” model is not known A justifi-cation of the traditional view, however vague, stems from the confusion of statisticswith probability theory If we construct problems in probability theory that resem-ble statistical problems, the relevance of the results clearly depends on how well ourhypothesized “true” model captures the properties in the data In simple cases likethe coin tossing the resemblance can be good, and useful results can be obtained.However, in realistic more complex statistical problems we can be seriously misled
tra-by the results found that way, for they are based on a methodology which is close
to circular reasoning Nowhere is the failure of the logic more blatant than in theimportant problem of hypothesis testing Its basic assumption is that one of the hy-potheses is true even when none exists, which leads to a dangerous distorted theory
We discuss below model testing without such an assumption
In absence of the “true” model a yardstick can be taken as the value f (x n;θk);
a large probability or density value represents a good fit and a small one a bad fit
Equivalently, we can take the number log 1/ f (x n;θk), which by the Kraft inequalitycan be taken as an ideal code length, ideal because a real code length must be integer
valued But there is a well-known difficulty, for frequently log 1/ f (x n;θk)→ 0 as
k → n We do get a good fit, but the properties learned include too much, “noise”
if we fit too complex models To overcome this without resorting to ad hoc means
to penalize the undefined complexity we must face the problem of defining andmeasuring “complexity”
“Complexity” and its close relative “information” are widely used words withambiguous meaning We know of only two formally defined and related no-
tions of complexity which have had significant implications, namely, algorithmic complexity and stochastic complexity Both are related to Shannon’s entropy and
Trang 352 Model Selection and Testing by the MDL Principle 27
hence fundamentally to Hartley’s information: the logarithm of the number of
ele-ments in a finite set; that is, the shortest code length as the number of binary digits
in the coded string, with which any element in the set can be encoded Hence, the
amount of “complexity” is measured by the unit, bit, which will also be used to
measure “information”, to be defined later In reality, the set that includes the object
of interest is not always finite, and the optimal code length is taken either literally,whenever possible, or the shortest in a probability sense, or the shortest in the worstcase In practice, however, the sense of optimality is often as good as literally theshortest code length
The formal definition of complexity means that it is always relative to a work within which the description or coding of the object is done A conceptuallysupreme way, due to Solomonoff [23], is to define it as the length of the shortestprogram of a universal computer that delivers it as the output Hence, the frame-work is the programming language of the computer Such a programming languagecan describe the set of partial recursive functions for integers and tuples of them,and if we identify a finite or constructive description with a computer program
frame-we can consider program defined sets and their elements as the objects of interest.With a modified version of Solomonoff’s definition due to Kolmogorov and Chaitin,Kolmogorov defined a data string’s model as a finite set that includes the string, and
the best such model gets defined in a manner which amounts to the MDL
princi-ple The best model we take to give the algorithmic information in the string, given
by the complexity of the best model The trouble with all this is the fact that such
an “algorithmic complexity” itself is noncomputable and hence cannot be mented However, Kolmogorov’s plan is so beautiful and easy to describe withoutany involved mathematics just because it is noncomputable, and we discuss it below
imple-We wish to imitate Kolmogorov’s plan in statistical context in such a way that theneeded parts are computable, and hence that they can at least be approximated with
an estimate of the approximation error The role of the set of programs will be played
by a set of parametric models, (2.1), (2.2) Such a set is simply selected, perhaps in atentative way, but we can compare the behavior of several such suggestions Nothingcan be said of how to find the very best set, which again is a noncomputable problem,and consequently we are trying to do only what can be done, rather than somethingthat cannot be done – ever
In this chapter we discuss the main steps in the program to implement the afore
outlined ideas for a theory of modeling We begin with a section on the MDL
princi-ple and the Bayesian philosophy, and discuss their fundamental differences Thesetwo are often confused Next, we describe three universal models, the first of which,the Normalized Maximum Likelihood model, makes the definition of stochasticcomplexity possible and gives a criterion for fitting the number of parameters This
is followed by an outline of Kolmogorov’s structure function in the algorithmictheory of complexity, and its implementation within probability models The result-ing optimally quantized parameters lead into the idea of optimally distinguishablemodels, which are also defined in an independent manner thereby providing confi-dence in the constructs These, in turn, lead naturally to model testing, in which theinherent lopsidedness in testing a null-hypothesis against a composite hypothesis inthe Neyman–Pearson theory is removed
Trang 362.2 The MDL Principle and Bayesian Philosophy
The outstanding feature of Bayesian philosophy is the use of distributions not onlyfor data but also for the parameters, even when their values are non repeating Thiswidens enormously the use of probability models in applications over the orthodoxstatistics To be specific, consider a class of parametric modelsM = { f (x n;μ)},
whereμrepresents any type of parameters, together with a “prior” distribution Q(μ)for them By Bayes’ formula, then, the prior distribution is converted in light ofobserved data into the posterior
For the Bayesians the distribution Q for the parameters represents prior
knowl-edge, and its meaning is the probability of the event that the valueμ is the “true”value This causes some difficulty if no value is “true”, which is often the case An
elaborate scheme has been invented to avoid this by calling Q(μ) the “degree ofbelief” in the valueμ The trouble comes from the fact that any rational demand ofthe behavior of “degrees of belief” makes them to satisfy the axioms for probability,which apparently leaves the original difficulty intact
A much more serious difficulty is the selection of the prior, which obviously plays
a fundamental role in the posterior and all the further developments One attempt is
to try to fit it to the data, but that clearly not only contradicts the very foundation ofBayesian philosophy but without restrictions on the priors disastrous outcomes can
be prevented only by ad hoc means A different and much more worthwhile line ofendeavor is to construct “noninformative” priors, even though there is the difficulty
in defining the “information” to be avoided Clearly, a uniform distribution appeals
to intuition whenever it can be defined, but as we see below there is a lot of usefulknowledge to be gained even with such “noninformative” distributions!
The MDL principle, too, permits the use of distributions for the parameters ever, the probabilities used are defined in terms of code lengths, which not onlygives them a physical interpretation but also permits their optimization thereby re-moving the anomalies and other difficulties in the Bayesian philosophy Because ofthe difference in the aims and the means to achieve them the development of the
How-MDL principle and the questions it raises differ drastically from Bayesian analysis.
It calls for information and coding theory, which are of no particular interest noreven utility for the Bayesians
The MDL principle was originally aimed at obtaining a criterion for estimating
the number of parameters in a class of ARMA models [13], while a related methodwas described earlier in [29] for an even narrower problem The criterion was arrived
Trang 372 Model Selection and Testing by the MDL Principle 29
at by optimization of the quantification of the real-valued parameters Unfortunatelythe criterion for the ARMA models turned out to be asymptotically equivalent with
BIC [21], which has given the widely accepted wrong impression that the MDL principle is BIC The acronym “BIC” stands for Bayesian Information Criterion,
although the “information” in it has nothing to do with Shannon information norinformation theory
The usual form of the MDL principle calls for minimization of the ideal code
length
log 1/ f (x n;μ) + L(μ), where L(μ) is a prefix code length for the parameters in order to be able to separatetheir code from the rest Because a prefix code length defines a distribution by
Kraft inequality we can write L(μ) = log 1/Q(μ) We call them “priors” to respectthe Bayesian tradition even though they have nothing to do with anybody’s priorknowledge The fact that the meaning of the distribution is the (ideal) code length
log 1/Q(μ) with which the parameter value can be encoded imposes restrictions onthese distributions, and unlike in the Bayesian philosophy we can try to optimize them
The intent with the MDL principle is to obtain the shortest code length of the data
in a self containing manner; i.e., by including in the code length all the parts needed.But technically this amounts to a prefix code length for the data to be calculated,ideally, using only the means provided by the model class, although in some casesthis may have to be slightly augmented Otherwise, regardless of the model classgiven we could get a shorter coding by Kolmogorov complexity, which we want
to exclude Hence, not only are the minimizing parameters and the prior itself to
be included when the code length is calculated but also the probability model forthe priors, and so on This process stops when the last model for a model for amodel is found which either assigns equal code length to its arguments or iscommon knowledge and need not be encoded Since each model teaches a property
of the data, we may stop when no more properties of interest can be learned Usually,
two or three steps in this process suffice For more on the MDL principle we refer
to [7]
Before applying the MDL principle to the model classes (2.1) and (2.2) we trate the process with an example Take an integer n as the object of interest without
illus-any family of distributions As the first “model” we take the set{1,2, ,2 m }, where
m is the smallest integer such that n belongs to the set or that log2n ≤ m We need to encode the model, or the number m We repeat the argument and get a model for the first model, or the smallest integer k such that log2m ≤ k This process ends when the
last model has only one element 1 = 20 Such an actual coding system was described
in [6], and the total code length is about L(n) = log ∗2(n) = log2n + log2log2n + ···,
the sum ending with the last positive iterated logarithm value It was shown in [14]
that P ∗ (n) = C −12−L(n) for C = 2.865 defines a universal prior for the integers.
We can now define log21/P ∗ (n) as the complexity of n and log21/P ∗ (n) − log2n,
or, closely enough, the sum log∗2(log2) = log2log2n + ···, as the “information” in the number n that we learn with the models given In other words, the amount of in- formation is the code length for encoding the models needed to encode the object n.
Trang 382.3 Complexity and Universal Models
A universal model is a fundamental construct borrowed from coding theory andlittle known in ordinary statistics Its roots, too, are in the algorithmic theory ofcomplexity In fact, it was the very reason for Solomonoff’s query for the shortestprogram length, because he wanted to have a universal prior for the integers asbinary strings; see the section on Kolmogorov complexity below
Given a class of models (2.1) we define a model f (y n ; k) to be universal for the
class if
1
nlog f (y
n;θ, k) f(y n ; k) → 0 (2.3)for all parametersθ ∈Ωk , and optimal universal if the convergence is the fastest
possible for almost allθ; the convergence is in a probabilistic sense, either in the
mean, taken with respect to f (y n;θ, k), in probability, or almost surely The mean
sense is of particular interest, because we then have the convergence in Kullback–Leibler distance between the two density functions and hence consistency The qual-ification “almost allθ” may be slightly modified; see [15, 16] If we extend thesedefinitions to the model class (2.2), we can talk about consistency in the number ofparameters, and ask again for fastest convergence In the literature studies have beenmade of the weakest criteria under which consistency takes place [8] Such criteriacannot measure consistency in terms of the Kullback–Leibler distance and do notseem to permit query for optimality
2.3.1 The NML Universal Model
Let y n → θ(y n) be the Maximum Likelihood (ML) estimate which minimizes the
ideal code length log 1/ f (y n;θ, k) for a fixed k Consider the maximized joint
den-sity or probability of the data in the minimized negative logarithm form
L(y n , θ) =−log f (y n; θ, k)/h( θ, θ))− logw( θ) , (2.4)
by an application of the MDL principle It was originally obtained as the solution to
Shtarkov’s minmax problem
Trang 392 Model Selection and Testing by the MDL Principle 31
q D(g n ; k)) + logC n,k , where g ranges over any set that includes f (y n ; k), as well as the associated minmax
problem – although the maximizing distribution is then nonunique
For the special prior w( θ) the joint density function equals the marginal p(y n) =
θ∈Ω k f (y n; θ, k)w( θ)d θ, and the maximized posterior is given by the δ( θ,θfunctional, whose integral over any subset ofΩkof volumeΔ is unity This meansthat the posterior probability of the ML parameters, quantized to any precision, is
)-unity We might call it the canonical prior.
We have defined [17]
log 1/ f (y n ; k) = log 1/ f (y n; θ(y n ), k) + logC n,k (2.8)
as the stochastic complexity of the data y n, given the class of models (2.1) If themodel class satisfies the central limit theorem and other smoothness conditions thestochastic complexity is given by the decomposition [17]
logf (y
n; θ(y n ), k) f(y n ; k) =
There is a further asymptotic justification for stochastic complexity by the theorems
in [15] and [16], which imply, among other things, that f (y n ; k) is optimal universal
in the mean sense This means that there is no density function which converges
to the data generating density function f (y n;θ, k) faster than f (y n ; k) In particular,
no other estimator y n → ¯θ(y n) can give a smaller mean length for the normalized
density function f (y n; ¯θ(y n ), k) than the ML estimator.
We mention in conclusion that for discrete data the integral in the normalizing
coefficient becomes a sum, for which efficient algorithms for reasonable size of n
have been developed in a number of special cases [9, 10, 24–27] For such casesthe structure of the models can be learned much better than by minimization of theasymptotic criterion (2.9)
Trang 402.3.1.1 NML Model for ClassM
The minimization of the stochastic complexity (2.8) with respect to k is meaningful
both asymptotically and non-asymptotically For instance, for discrete data
log 1/ f (y n ; n) can equal logC n,n, which is larger than the minimized stochastic
complexity log 1/ f (y n; k) This is a bit baffling, because log 1/ f (y n; k) is not a
prefix code length; i.e., f (y n; k) is not a model To get a logical explanation as well
as to be able at least to define an optimal universal model for the class (2.2) let
y n → k(y n) = k denote the estimator that maximizes f (y n ; k), and hence minimizes
the joint code length
L(y n , k) = −log f(y n
max
g min
q E glog f(Y n; k(Y n))
q(Y n) .
The difficulty is to calculate the probabilities g( k; k), but for our purpose here we
do not need that, for we see that k, which minimizes the non-prefix code length log 1/ f (y n; k(y n )), also minimizes the prefix code length log 1/ f (y n)
Although f (y n ; k) provides an excellent criterion for the structure and the number
of parameters in the usual cases where k is not too large in comparison with n, it is
not good enough for denoising, where k is of the order of n Then we cannot late h( θ, θ), C n,k, nor the joint density, accurately enough One way in the regressionproblems, which include denoising problems, is to approximate the maximum jointdensity by two-part coding thus
calcu-log 1/ f (y n |X n; θk , k) + L(k), (2.13)
... class="text_page_counter">Trang 392 Model Selection and Testing by the MDL Principle 31
q D(g n ; k)) + logC...
θ∈Ω k f (y n; θ, k)w( θ)d θ, and the maximized posterior is given by the δ( θ,θfunctional, whose integral over any subset... n, given the class of models (2.1) If themodel class satisfies the central limit theorem and other smoothness conditions thestochastic complexity is given by the decomposition [17]
logf