Themain goal when using GP or other ML techniques is not only tocreate a program that exactly cover the training examples, butalso to have a good generalization ability Although, recentl
Trang 1MINISTRY OF NATIONAL DEFENSE
MILITARY TECHNICAL ACADEMY
NGUYEN THI HIEN
Trang 2THE THESIS IS SUBMITTED TO MILITARY TECHNICAL ACADEMY -
MINISTRY OF NATIONAL DEFENSE
Academic supervisors:
1 Assoc/Prof Dr Nguyen Xuan Hoai
2 Prof Dr R.I (Bob) McKay
Reviewer 1: Prof Dr Vu Duc Thi
Reviewer 2: Assoc/Prof Dr Nguyen Duc Nghia
Reviewer 3: Dr Nguyen Van Vinh
The thesis was evaluated by the examination board of the academy by the decision number 1767/QĐ-HV, 1st July 2014
of the Rector of Military Technical Academy, meeting at Military Technical Academy on day month year
This thesis can be found at:
- Library of Military Technical Academy
- National Library of Vietnam
Trang 3Genetic programming (GP) paradigm was first proposed by Koza(1992) can be viewed as the discovery of computer programs whichproduce desired outputs for particular inputs Despite advances
in GP research, when it is applied to learning tasks, the issue
of generalization of GP has not been taken the attention that itdeserves This thesis focuses on the generalization aspect of GPand proposes mechanisms to improve GP generalization ability
1 Motivations
GP has been proposed as a machine learning (ML) method Themain goal when using GP or other ML techniques is not only tocreate a program that exactly cover the training examples, butalso to have a good generalization ability Although, recently, theissue of generalization in GP has received more attention that
it deserves, the use of more traditional ML techniques has beenrather limited It is hoped that adapting ML techniques to GPwould help to improve GP generalization ability
2 Research Perspectives
The majority of theoretical work has been derived from imentation The approach taken in this thesis is also based onthe careful designed experiments and the analysis of experimentalresults
exper-3 Contributions of the Thesis
This thesis makes the following contributions:
1) An early stopping approach for GP that is based on theestimate of the generalization error as well as propose some
Trang 4new criteria apply to early stopping for GP The GP trainingprocess will be stopped early at the time which promises tobring the solution with the smallest generalization error.2) A progressive sampling framework for GP, which divides GPlearning process into several layers with the training set size,starting from being small, gradually get bigger at each layer.The termination of training on each layer will be based oncertain early stopping criteria.
4 Organization of the Thesis
Chapter 1 first introduces the basic components of GP as well as
some benchmark problem domains Two important research sues and a metaphor of GP search are then discussed Next, themajor issues of GP are discussed especially when the GP is consid-ered as a ML system It first overviews the approaches proposed
is-in the literature that help to improve the generalization ability of
GP Then, the solution complexity (code bloat), in the particularcontext of GP, and its relations to GP learning performance are
discussed Chapter 2 provides a backgrounds on a number of
con-cepts and techniques subsequently used in the thesis for ing GP learning performance They include progressive sampling,
improv-layer learning, and early stopping in ML Chapter 3 presents one
of the main contributions of the thesis It proposes some criteriaused to determine when to stop the training of GP with an aim
to avoid over-fitting Chapter 4 develops a learning framework for
GP that is based layer learning and progressive sampling
Chap-ter 4.4 concludes the thesis summarizing the main results and
proposing suggestions for future works that extend the research
in this thesis
Trang 5Chapter 1
BACKGROUNDS
This chapter describes the representation and the specific gorithm components used in the canonical version of GP Thechapter ends with a comprehensive overview of the literature onthe approaches used to improve GP generalization ability
al-1.1 Genetic Programming
The basic algorithm is as follows:
1) Initialise a population of solutions
2) Assign a fitness to each population member
3) While the Stopping criterion is not met
4) Produce new individuals using operators and the existingpopulation
5) Place new individuals into the population
6) Assign a fitness to each population member, and test for theStopping criterion
7) Return the best fitness found
1.1.1 Representation, Initialization and Operators in GP
1.1.2 Representation
GP programs are expressed as expression trees The variables
and constants in the program are called terminals in GP, which
are the leaves of the tree The arithmetic operations are internal
nodes (called functions in the GP literature) The sets of allowed
functions and terminals together form the primitive symbol set of
a GP system
Trang 61.1.3 Initializing the Population
The ramped half-n-half method is the most commonly used toperform tree initialisation It was introduced by Koza and proba-bilistically selects between two recursive tree generating methods:Grow and Full
1.1.4 Fitness and Selection
Fitness is the measure used by GP to indicate how well a gram has learned to predict the output(s) from the input(s) Errorfitness function and squared error fitness function are used com-monly in GP The most commonly employed method for selectingindividuals in GP is tournament selection
pro-1.1.5 Recombination
• Choose two individuals as parents that is based on
repro-duction selection strategy
• Select a random subtree in each parent The subtrees
con-stituting terminals are selected with lower probability thanother subtrees
• Swap the selected subtrees between the two parents The
resulting individuals are the children
1.1.6 Mutation
An individual is selected for mutation using fitness proportionalselection A single function or terminal is selected at random fromamong the set of function and terminals making up the originalindividual as the point of mutation The mutation point, alongwith the subtree stemming from mutation point, is then removed
Trang 7from the tree, and replaced with the new, randomly generatedsubtree.
1.1.9 Some Variants of GP
The GP community has proposed numerous different approaches
to program evolution: Linear GP, Graph Based GP, GrammarBased GP
1.2 An Example of Problem Domain
This thesis uses the 10 polynomials in Table 1.1 with white sian) noise of standard deviation 0.01 It means each target func-
(Gaus-tion is F (x) = f (x) + ϵ 4 more real-world data sets are from
the UCI ML repository and StatLib are also used as shown inTable 1.2
1.3 GP and Machine Learning Issues
This section explores some open questions in GP research from
ML perspective, which motivates the research in this thesis
Trang 8Table 1.1: Test Functions
or both (depth and size)
1.3.2 Bloat
In the course of evolution, the average size of the individual in the
GP population often get increased largely Typically the increase
in program size is not accompanied by any corresponding increase
in fitness The origin of this phenomenon, which is know as bloat,has effectively been a subject of research for over a decade
Trang 91.3.3 Generalization and Complexity
Most of research in improving generalization ability of the GPhas concentrated on avoiding over-fitting on training data One
of main cause of over-fitting has been identified as the ity" of the hypothesis generated by the learning algorithm When
"complex-GP trees grow up to fit or to specialize on "difficult" learningcases, it will no longer have the ability to generalize further, this
is entirely consistent with the principle of Occam’s razor statingthat the simple solutions should always be preferred The rela-tionship between generalization and individual complexity in GPhas often been studied in the context of code bloat, which is ex-traordinary enlargement of the complexity of solutions withoutincreasing their fitness
1.4 Generalization in GP
1.4.1 Overview of Studies on GP Generalization Capability
Common to most of the research are the problems with obtaininggeneralization in GP and the attempts to overcome these prob-lems These approaches can be categorized as follows:
• Using training and testing and in order to promote
general-ization in supervised learning problems
• Changing training instances from one generation to another
and evaluation of performance based on subsets of traininginstances are suggested, or consider to combine GP withensemble learning techniques
• Changing the formal implementation of GP: representation
GP, genetic operators and selection strategy
Trang 101.4.2 Problem of Improving Training Efficiency
To derive and evaluate a model which uses GP as the core learningcomponent we are sought to answer the following questions:1) Performance: How sensitive are the model to changes of thelearning problem, or initial conditions?
2) The effective size of the solution - Complexity: the ple of Occam’s razor states that among being equally, thesimpler is more effective
princi-3) Training time: How long will the learning take, i.e., how fast
or slow is the training phase?
Chapter 2
EARLY STOPPING, PROGRESSIVE SAMPLING,
AND LAYERED LEARNING
Early stopping, progressive sampling, and layered learning are thetechniques for enhancing the learning performance in ML Theyhave been used in combination with a number of ML techniques
2.1 Over-training, Early Stopping and Regularization
The overtraining problem is common in the field of ML Thisproblem is related with learning process of learning machines Theovertraining has been an open topic for discussion motivating theproposal of several techniques like regularization, early stopping
2.1.1 Over-training
When the number of training samples is infinitely large and theyare unbiased, the ML system parameters converge to one of the
Trang 11local minima of the specified risk function (expected loss) Whenthe number of training samples is finite, the true risk function isdifferent from the empirical risk function to be minimized Thus,since the training samples are biased the parameters of a learningmachine might converge to a biased solution This is known as
over-fitting or over training in ML.
2.1.2 Early Stopping
During the training phase of a learning machine, the tion error might decrease in an early period reaching a minimumand then increase as the training goes on, while the training er-ror monotonically decreases Therefore, it is considered better tostop training at an adequate time, the class of techniques that are
generaliza-based on this are referred to as early stopping
to do so
Trang 122.2 Progressive Sampling
PS is a school of training techniques in traditional machine ing for dealing with large training data sets As Provost et al.(1999) describe PS, a learning algorithm is trained and retrainedwith increasingly larger random samples until the accuracy (gener-alization) of the learnt model stops improving For a training data
learn-set of size N , the task of PS is to determine a sampling schedule
S = {n0, n1, n2, , n k } where n i is size of sample to be provided
to the learning algorithm in stage i; it satisfies i < j ⇒ n i < n j
and n i ≤ N, ∀i, j ∈ {0, 1, , k} A learning machine M is then
trained according to the sampling schedule S Generally, these
factors depend on the learning problem and algorithm Provost et
al investigated a number of sampling schedules as static schedule,simple schedule, geometric schedules; geometric schedules wereformally and experimentally shown to be efficient for learning al-gorithms of polynomial time complexity
2.3 Layered Learning
The layered learning (LL) paradigm is described formally by Stoneand Veloso (2000) Intended as a means for dealing with problemsfor which finding a direct mapping from inputs to outputs is in-tractable with existing learning algorithms, the essence of the ap-proach is to decompose a problem into subtasks, each of which isthen associated with a layer in the problem-solving process LL is
a ML paradigm defined as a set of principles for the construction
of a hierarchical, learned solution to a complex task
Trang 133.1 Method
Before the training (evolutionary run) commences, the entire dataset is divided into three subsets - training (50%), validation (25%),and test sets (25%) in random way
3.1.1 Proposed Classes of Stopping Criteria
To detect the fitting during a run of GP, the amounts of fitting is calculated as proposed in Vanneschi’s paper (2010) Thedetail of the method is presented in Algorithm 1 below In the
over-algorithm, bvp is the "best valid point" and it stands for the best
validation fitness (error on the validation set) reached until thecurrent generation of a GP run, excluding the generations (usu-ally at the beginning of the run) where the best individual fitness
on the training set has a better validation fitness value than the
training one; tbtp stands for "training at best valid point",i.e the
training fitness of the individual that has validation fitness equal
to btp; Training Fit(g) is a function that returns the best training fitness in the population at generation g; Val Fit(g) returns the
validation fitness of the best individual on the training set at
gen-eration g. The first class of stopping criteria for an early stopping of a GP run is satisfied if it has been detected
Trang 14Algorithm 1: Measuring the over-fitting for GP systems.
over fit(0)=0; bvp = Val Fit(0);
over f it(g) = |T raining F it(g) − V al F it(g)| − |tbtp − bvp| ;
for over-fitting in m successive generations, where m is a
tunable parameter
OF m : stop after generation g when m successive generations where over-fitting happened up to g
overf it(g), , overf it(g − m) > 0
However, the above criterion is taken into account to occur or notover-fit phenomenon without consideration to its value This leads
us to The second class of stopping criteria: stop when the
over-fitting does not decrease in m successive generations
V F m : stop when m successive generations where over-fitting value
increased
overf it(g) ≥overfit(g − 1)≥ ≥overfit(g − m) > 0
In the early generation, the desire for early stopping should besmaller than that for the late generations Therefore, we also pro-
pose a variant of the second stopping criterion called the true
adaptive stopping criteria and its work as in Algorithm 2
(where d is the number of generations those over-fit values of creasing, f g is the generation started the chain of over-fit value not decreasing, lg is the last generation of the chain of over-fit value not decreasing, random is a random value in [0, 1], g is cur- rent generation, M axGen is maximum number of generation).
Trang 15in-Algorithm 2: True Adap Stopping
on test set, so that the validation error is used as an estimate of
the generalization error The third class of stopping criteria:
stop as soon as the generalization loss exceeds a certain
threshold The class GL α is defined as GL α: stop after first
indi-decreased rapidly, generalization losses have higher chance to be
"repaired"; thus assume that over-fitting does not begin until theerror decreases only slowly The progress training is considered
training strip of length k to be a sequence k generations bered g + 1, , g + k where g is divisible by k It measures how
num-much the average training error during the strip larger than the