A Study of Generalization in Genetic Programming = Nghiên cứu khả năng khái quát hóa của lập trình di truyền (tóm tắt + toàn văn)

Themain goal when using GP or other ML techniques is not only tocreate a program that exactly cover the training examples, butalso to have a good generalization ability Although, recentl

Trang 1

MINISTRY OF NATIONAL DEFENSE

MILITARY TECHNICAL ACADEMY

NGUYEN THI HIEN

Trang 2

THE THESIS IS SUBMITTED TO MILITARY TECHNICAL ACADEMY -

MINISTRY OF NATIONAL DEFENSE

Academic supervisors:

1 Assoc/Prof Dr Nguyen Xuan Hoai

2 Prof Dr R.I (Bob) McKay

Reviewer 1: Prof Dr Vu Duc Thi

Reviewer 2: Assoc/Prof Dr Nguyen Duc Nghia

Reviewer 3: Dr Nguyen Van Vinh

The thesis was evaluated by the examination board of the academy by the decision number 1767/QĐ-HV, 1st July 2014

of the Rector of Military Technical Academy, meeting at Military Technical Academy on day month year

This thesis can be found at:

- Library of Military Technical Academy

- National Library of Vietnam

Trang 3

Genetic programming (GP) paradigm was first proposed by Koza(1992) can be viewed as the discovery of computer programs whichproduce desired outputs for particular inputs Despite advances

in GP research, when it is applied to learning tasks, the issue

of generalization of GP has not been taken the attention that itdeserves This thesis focuses on the generalization aspect of GPand proposes mechanisms to improve GP generalization ability

1 Motivations

GP has been proposed as a machine learning (ML) method Themain goal when using GP or other ML techniques is not only tocreate a program that exactly cover the training examples, butalso to have a good generalization ability Although, recently, theissue of generalization in GP has received more attention that

it deserves, the use of more traditional ML techniques has beenrather limited It is hoped that adapting ML techniques to GPwould help to improve GP generalization ability

2 Research Perspectives

The majority of theoretical work has been derived from imentation The approach taken in this thesis is also based onthe careful designed experiments and the analysis of experimentalresults

exper-3 Contributions of the Thesis

This thesis makes the following contributions:

1) An early stopping approach for GP that is based on theestimate of the generalization error as well as propose some

Trang 4

new criteria apply to early stopping for GP The GP trainingprocess will be stopped early at the time which promises tobring the solution with the smallest generalization error.2) A progressive sampling framework for GP, which divides GPlearning process into several layers with the training set size,starting from being small, gradually get bigger at each layer.The termination of training on each layer will be based oncertain early stopping criteria.

4 Organization of the Thesis

Chapter 1 first introduces the basic components of GP as well as

some benchmark problem domains Two important research sues and a metaphor of GP search are then discussed Next, themajor issues of GP are discussed especially when the GP is consid-ered as a ML system It first overviews the approaches proposed

is-in the literature that help to improve the generalization ability of

GP Then, the solution complexity (code bloat), in the particularcontext of GP, and its relations to GP learning performance are

discussed Chapter 2 provides a backgrounds on a number of

con-cepts and techniques subsequently used in the thesis for ing GP learning performance They include progressive sampling,

improv-layer learning, and early stopping in ML Chapter 3 presents one

of the main contributions of the thesis It proposes some criteriaused to determine when to stop the training of GP with an aim

to avoid over-fitting Chapter 4 develops a learning framework for

GP that is based layer learning and progressive sampling

Chap-ter 4.4 concludes the thesis summarizing the main results and

proposing suggestions for future works that extend the research

in this thesis

Trang 5

Chapter 1

BACKGROUNDS

This chapter describes the representation and the specific gorithm components used in the canonical version of GP Thechapter ends with a comprehensive overview of the literature onthe approaches used to improve GP generalization ability

al-1.1 Genetic Programming

The basic algorithm is as follows:

1) Initialise a population of solutions

2) Assign a fitness to each population member

3) While the Stopping criterion is not met

4) Produce new individuals using operators and the existingpopulation

5) Place new individuals into the population

6) Assign a fitness to each population member, and test for theStopping criterion

7) Return the best fitness found

1.1.1 Representation, Initialization and Operators in GP

1.1.2 Representation

GP programs are expressed as expression trees The variables

and constants in the program are called terminals in GP, which

are the leaves of the tree The arithmetic operations are internal

nodes (called functions in the GP literature) The sets of allowed

functions and terminals together form the primitive symbol set of

a GP system

Trang 6

1.1.3 Initializing the Population

The ramped half-n-half method is the most commonly used toperform tree initialisation It was introduced by Koza and proba-bilistically selects between two recursive tree generating methods:Grow and Full

1.1.4 Fitness and Selection

Fitness is the measure used by GP to indicate how well a gram has learned to predict the output(s) from the input(s) Errorfitness function and squared error fitness function are used com-monly in GP The most commonly employed method for selectingindividuals in GP is tournament selection

pro-1.1.5 Recombination

• Choose two individuals as parents that is based on

repro-duction selection strategy

• Select a random subtree in each parent The subtrees

con-stituting terminals are selected with lower probability thanother subtrees

• Swap the selected subtrees between the two parents The

resulting individuals are the children

1.1.6 Mutation

An individual is selected for mutation using fitness proportionalselection A single function or terminal is selected at random fromamong the set of function and terminals making up the originalindividual as the point of mutation The mutation point, alongwith the subtree stemming from mutation point, is then removed

Trang 7

from the tree, and replaced with the new, randomly generatedsubtree.

1.1.9 Some Variants of GP

The GP community has proposed numerous different approaches

to program evolution: Linear GP, Graph Based GP, GrammarBased GP

1.2 An Example of Problem Domain

This thesis uses the 10 polynomials in Table 1.1 with white sian) noise of standard deviation 0.01 It means each target func-

(Gaus-tion is F (x) = f (x) + ϵ 4 more real-world data sets are from

the UCI ML repository and StatLib are also used as shown inTable 1.2

1.3 GP and Machine Learning Issues

This section explores some open questions in GP research from

ML perspective, which motivates the research in this thesis

Trang 8

Table 1.1: Test Functions

or both (depth and size)

1.3.2 Bloat

In the course of evolution, the average size of the individual in the

GP population often get increased largely Typically the increase

in program size is not accompanied by any corresponding increase

in fitness The origin of this phenomenon, which is know as bloat,has effectively been a subject of research for over a decade

Trang 9

1.3.3 Generalization and Complexity

Most of research in improving generalization ability of the GPhas concentrated on avoiding over-fitting on training data One

of main cause of over-fitting has been identified as the ity" of the hypothesis generated by the learning algorithm When

"complex-GP trees grow up to fit or to specialize on "difficult" learningcases, it will no longer have the ability to generalize further, this

is entirely consistent with the principle of Occam’s razor statingthat the simple solutions should always be preferred The rela-tionship between generalization and individual complexity in GPhas often been studied in the context of code bloat, which is ex-traordinary enlargement of the complexity of solutions withoutincreasing their fitness

1.4 Generalization in GP

1.4.1 Overview of Studies on GP Generalization Capability

Common to most of the research are the problems with obtaininggeneralization in GP and the attempts to overcome these prob-lems These approaches can be categorized as follows:

• Using training and testing and in order to promote

general-ization in supervised learning problems

• Changing training instances from one generation to another

and evaluation of performance based on subsets of traininginstances are suggested, or consider to combine GP withensemble learning techniques

• Changing the formal implementation of GP: representation

GP, genetic operators and selection strategy

Trang 10

1.4.2 Problem of Improving Training Efficiency

To derive and evaluate a model which uses GP as the core learningcomponent we are sought to answer the following questions:1) Performance: How sensitive are the model to changes of thelearning problem, or initial conditions?

2) The effective size of the solution - Complexity: the ple of Occam’s razor states that among being equally, thesimpler is more effective

princi-3) Training time: How long will the learning take, i.e., how fast

or slow is the training phase?

Chapter 2

EARLY STOPPING, PROGRESSIVE SAMPLING,

AND LAYERED LEARNING

Early stopping, progressive sampling, and layered learning are thetechniques for enhancing the learning performance in ML Theyhave been used in combination with a number of ML techniques

2.1 Over-training, Early Stopping and Regularization

The overtraining problem is common in the field of ML Thisproblem is related with learning process of learning machines Theovertraining has been an open topic for discussion motivating theproposal of several techniques like regularization, early stopping

2.1.1 Over-training

When the number of training samples is infinitely large and theyare unbiased, the ML system parameters converge to one of the

Trang 11

local minima of the specified risk function (expected loss) Whenthe number of training samples is finite, the true risk function isdifferent from the empirical risk function to be minimized Thus,since the training samples are biased the parameters of a learningmachine might converge to a biased solution This is known as

over-fitting or over training in ML.

2.1.2 Early Stopping

During the training phase of a learning machine, the tion error might decrease in an early period reaching a minimumand then increase as the training goes on, while the training er-ror monotonically decreases Therefore, it is considered better tostop training at an adequate time, the class of techniques that are

generaliza-based on this are referred to as early stopping

to do so

Trang 12

2.2 Progressive Sampling

PS is a school of training techniques in traditional machine ing for dealing with large training data sets As Provost et al.(1999) describe PS, a learning algorithm is trained and retrainedwith increasingly larger random samples until the accuracy (gener-alization) of the learnt model stops improving For a training data

learn-set of size N , the task of PS is to determine a sampling schedule

S = {n0, n1, n2, , n k } where n i is size of sample to be provided

to the learning algorithm in stage i; it satisfies i < j ⇒ n i < n j

and n i ≤ N, ∀i, j ∈ {0, 1, , k} A learning machine M is then

trained according to the sampling schedule S Generally, these

factors depend on the learning problem and algorithm Provost et

al investigated a number of sampling schedules as static schedule,simple schedule, geometric schedules; geometric schedules wereformally and experimentally shown to be efficient for learning al-gorithms of polynomial time complexity

2.3 Layered Learning

The layered learning (LL) paradigm is described formally by Stoneand Veloso (2000) Intended as a means for dealing with problemsfor which finding a direct mapping from inputs to outputs is in-tractable with existing learning algorithms, the essence of the ap-proach is to decompose a problem into subtasks, each of which isthen associated with a layer in the problem-solving process LL is

a ML paradigm defined as a set of principles for the construction

of a hierarchical, learned solution to a complex task

Trang 13

3.1 Method

Before the training (evolutionary run) commences, the entire dataset is divided into three subsets - training (50%), validation (25%),and test sets (25%) in random way

3.1.1 Proposed Classes of Stopping Criteria

To detect the fitting during a run of GP, the amounts of fitting is calculated as proposed in Vanneschi’s paper (2010) Thedetail of the method is presented in Algorithm 1 below In the

over-algorithm, bvp is the "best valid point" and it stands for the best

validation fitness (error on the validation set) reached until thecurrent generation of a GP run, excluding the generations (usu-ally at the beginning of the run) where the best individual fitness

on the training set has a better validation fitness value than the

training one; tbtp stands for "training at best valid point",i.e the

training fitness of the individual that has validation fitness equal

to btp; Training Fit(g) is a function that returns the best training fitness in the population at generation g; Val Fit(g) returns the

validation fitness of the best individual on the training set at

gen-eration g. The first class of stopping criteria for an early stopping of a GP run is satisfied if it has been detected

Trang 14

Algorithm 1: Measuring the over-fitting for GP systems.

over fit(0)=0; bvp = Val Fit(0);

over f it(g) = |T raining F it(g) − V al F it(g)| − |tbtp − bvp| ;

for over-fitting in m successive generations, where m is a

tunable parameter

OF m : stop after generation g when m successive generations where over-fitting happened up to g

overf it(g), , overf it(g − m) > 0

However, the above criterion is taken into account to occur or notover-fit phenomenon without consideration to its value This leads

us to The second class of stopping criteria: stop when the

over-fitting does not decrease in m successive generations

V F m : stop when m successive generations where over-fitting value

increased

overf it(g) ≥overfit(g − 1)≥ ≥overfit(g − m) > 0

In the early generation, the desire for early stopping should besmaller than that for the late generations Therefore, we also pro-

pose a variant of the second stopping criterion called the true

adaptive stopping criteria and its work as in Algorithm 2

(where d is the number of generations those over-fit values of creasing, f g is the generation started the chain of over-fit value not decreasing, lg is the last generation of the chain of over-fit value not decreasing, random is a random value in [0, 1], g is cur- rent generation, M axGen is maximum number of generation).

Trang 15

in-Algorithm 2: True Adap Stopping

on test set, so that the validation error is used as an estimate of

the generalization error The third class of stopping criteria:

stop as soon as the generalization loss exceeds a certain

threshold The class GL α is defined as GL α: stop after first

indi-decreased rapidly, generalization losses have higher chance to be

"repaired"; thus assume that over-fitting does not begin until theerror decreases only slowly The progress training is considered

training strip of length k to be a sequence k generations bered g + 1, , g + k where g is divisible by k It measures how

num-much the average training error during the strip larger than the

Định dạng
Số trang	27
Dung lượng	203,71 KB