A STUDY OF GENERALIZATION IN GENETIC PROGRAMMING

The goal of this thesis is to vestigate the aspects that aﬀect the generalization ability of GP as well as techniques frommore traditional machine learning literature that help to improv

Trang 1

MINISTRY OF EDUCATION AND TRAINING MINISTRY OF NATIONAL DEFENSE

MILITARY TECHNICAL ACADEMY

NGUYEN THI HIEN

Trang 2

MINISTRY OF EDUCATION AND TRAINING MINISTRY OF NATIONAL DEFENSE

MILITARY TECHNICAL ACADEMY

1 Assoc/Prof Dr Nguyen Xuan Hoai

2 Prof Dr R.I (Bob) McKay

Trang 4

Genetic Programming (GP) is a meta-heuristic technique that simulates biological tion GP differs to other Evolutionary Algorithms (EAs) in that it generates solutions toproblems in the form of computer programs The quality criterion for each individual isoften referred to as fitness in EAs and it is with which we determine which individualsshall be selected GP induces a population of computer programs that improve their fit-ness automatically as they experience the data on which they are trained According this,

evolu-GP can be seen as one of the machine learning methods Machine learning is a processthat trains a system over some training data for capturing the succinct causal relationshipsamong data The key parts of this process are the ”learning domain”, the ”training set”,the ”learning system” and the ”testing” (validating) of the learnt results The quality of alearnt solution is its ability to predict the outputs from future unseen data (often simulated

on a test set), which is often referred to as ”generalization” The goal of this thesis is to vestigate the aspects that affect the generalization ability of GP as well as techniques frommore traditional machine learning literature that help to improve the generalization andlearning efficiency of GP In particular, a number of early stopping criteria were proposedand tested The early stopping techniques (with the proposed stopping criteria) helped toavoid over-fitting in GP producing much simpler solutions in much shorter training time.Over-fitting generally occurs when a model is excessively complex, such as having too manyparameters relative to the number of observations A model which has been over-fit will

Trang 5

in-generally have poor predictive performance, as it can exaggerate minor fluctuations in thedata The thesis also proposes a framework to combine progressive sampling, a traditionalmachine learning technique, with early stopping for GP The experimental results showedthat progressive sampling in combination with early stopping help to further improve thelearning efficiency of GP (i.e to avoid over-fitting and produce simpler solutions in muchshorter training time) The early stopping techniques (with the proposed stopping crite-ria) helped to avoid over-fitting in GP producing much simpler solutions in much shortertraining time.

Trang 6

The ﬁrst person I would like to thank is my supervisor for directly guiding me through thePhD progress He’s enthusiasm is the power source to motivate me in this work His guidehas inspired much of the research in this thesis

I also wish to thank my co-supervisor Although staying far away, we usually had line research chats Working with him, I have learnt how to do research systematically Inparticular, the leaders of the Department of Software Engineering and Faculty of Informa-tion Technology, Military Technical Academy have frequently supported my research withregards to facilities and materials for running the experiments Dr Nguyen Quang Uy hasdiscussed a number of issues related to this research with me I would like to thank himfor his thorough comments and suggestions on my research

on-I also would like to acknowledge the supports from my family, especially my parents,Nguyen Van Hoan and Nguyen Thi Quy, who have worked hard and always believedstrongly in their children and to my husband, Nguyen Hai Trieu for sharing happiness anddiﬃculty in the life with me To my beloved daughter Nguyen Thi Hai Phuong, who wasborn before this thesis was completed, I would like to express my thanks for being such agood girl always cheering me up

Trang 7

0.1 Motivations 2

0.2 Research Perspectives 3

0.3 Contributions of the Thesis 3

0.4 Organization of the Thesis 4

1 Backgrounds 6 1.1 Genetic Programming 6

1.1.1 Representation, Initialization and Operators in GP 7

1.1.2 Some Variants of GP 17

1.2 An Example of Problem Domain 19

1.3 Machine Learning 19

1.4 GP and Machine Learning Issues 21

1.4.1 Search Space 21

1.4.2 Bloat 22

Trang 8

1.4.3 Generalization and Complexity 22

1.5 Generalization in GP 23

1.5.1 Overview of Studies on GP Generalization Ability 24

1.5.2 Methods for Improving GP Generalization Ability 25

1.5.3 Problem of Improving Training Eﬃciency 33

1.6 Summary 35

2 Early Stopping, Progressive Sampling, and Layered Learning 36 2.1 Over-training, Early Stopping and Regularization 36

2.1.1 Over-training 37

2.1.2 Early Stopping 37

2.1.3 Regularization Learning 39

2.2 Progressive Sampling 41

2.2.1 Types of Progressive Sampling Schedule 44

2.2.2 Detection of Convergence 47

2.3 Layered Learning 49

2.4 Conclusion 50

3 An Empirical Study of Early Stopping for GP 51 3.1 Method 53

3.1.1 Proposed Classes of Stopping Criteria 53

3.2 Experimental Settings 57

3.3 Results and Discussions 59

3.3.1 Comparison with GP 59

3.3.2 Comparison with Tarpeian 76

3.4 Conclusion 77

Trang 9

4 A Progressive Sampling Framework for GP - An Empirical Approach 79

4.1 The Proposed Method 82

4.1.1 The Sampling Schedule 84

4.1.2 The Initial Sample Set Size 84

4.1.3 The Number of Learning Layers 86

4.1.4 The Stopping Criterion for Each Layer 86

4.2 Experimental Settings 87

4.2.1 Data Preparation 87

4.2.2 GP Systems Conﬁgurations 87

4.3 Results and Discussions 89

4.3.1 Learning Eﬀectiveness Comparison 89

4.3.2 Learning Eﬃciency 91

4.3.3 Solution Complexity 92

4.3.4 Distribution of Best Solutions by Layers 93

4.3.5 Synergies between System Components 94

4.4 Conclusion 98

5 Conclusions and Future Work 99 5.1 Contributions 100

5.2 Future Directions 101

Trang 10

List of Figures

1.1 GP’s main loop [57] 71.2 GP expression tree representing max(x*x,x+3y) [57] 81.3 Creation of a full tree having maximum depth 2 (and therefore a total of seven nodes) using the Full initialisation method (t=time) [57] 101.4 Creation of a ﬁve node tree using the Grow initialisation method with a maximum depth of 2 (t=time) A terminal is chosen at t = 2, causing the left branch of the root to be terminated at that point even though the maximum depth has not been reached [57] 121.5 Example of subtree crossover. 141.6 Example of subtree mutation [57] 162.1 Idealized training and validation error curves Vertical axis: errors; horizontal axis:time [88] 382.2 A real validation error curve Vertical: validation set error; horizontal: time (in training epochs) [88] . 382.3 Learning curves and progressive sample. 442.4 Regions of the Eﬃciency of Arithmetic and Geometric Schedule Sampling [89] 46

3.1 Distribution of stop time (last generation of the run) of the OF criterion and GPM 74

Trang 11

3.2 Distribution of stop time (last generation of the run) of the VF criterion and GPM 743.3 Distribution of stop time (last generation of the run) of the GL criterion and GPM 753.4 Distribution of stop time (last generation of the run) of the PQ criterion and GPM 753.5 Distribution of stop time (last generation of the run) of the UP criterion and GPM 754.1 PS framework for GP . 844.2 GP Evolution of each layer 87

Trang 12

List of Tables

1.1 Test Functions 201.2 The real-world data sets 203.1 Parameter settings for the GP systems 583.2 Data Sets for Problems For the synthetic problems, the notation [min, max] deﬁnes the range from which the data points are sampled 593.3 Generalization error and run time of VF, OF, True Adap stopping criterionand GP, Tar 613.4 Size of best individual of VF, OF and True Adap stopping criterion and GP,Tar 623.5 p-value of run time and GE of VF, OF, True Adap stopping criterion andTar vs GP 633.6 p-value of complexity of VF, OF, True Adap stopping criterion and Tar vs

GP 643.7 Relative Generalization Errors and Run Time of OF, VF and True Adapstopping criterion (vs GP) 653.8 Generalization error and run time of GL, PQ and UP stopping criterion 663.9 Size of the best individual of GL, PQ and UP stopping criterion 673.10 p-value of GL, PQ and UP stopping criterion vs GP 68

Trang 13

3.11 p-value of size of best individual of comparison GL, PQ and UP stopping

criterion with GP 69

3.12 Relative Generalization Errors and Run Time of GL, PQ and UP stopping criterion (vs GP) 70

3.13 p-value of GE, run time and size of best solution of True Adap, PQ vs Tarpeian 77 4.1 Data Sets for the Test Functions 88

4.2 Evolutionary Parameter Settings for the Genetic Programming Systems 89

4.3 Mean and SD of Generalization Error, GPLL and GP (14 Problems) 90

4.4 Relative Average Generalization Errors and p-values (GPLL vs GP) (14 Problems) 90

4.5 Mean Run Time, GPLL and GP (14 Problems) 91

4.6 Relative Run Times and p-values (GPLLs vs GP) (14 Problems) 92

4.7 Mean Solution Size and p-values, GPLL vs GP (14 Problems) 93

4.8 Number of End-of-run Best Solutions First Found by GPLL in Each Layer 94 4.9 Average Generalization Error on 14 Problems 95

4.10 Average Run Time of All Systems 96

4.11 Average Size of Solutions Found by all Systems 97

Trang 15

Genetic programming (GP) paradigm was ﬁrst proposed by Koza [53] can be viewed as thediscovery of computer programs which produce desired outputs for particular inputs Acomputer program could be an expression, formula, plan, control strategy, decision tree or

a model depending on the type of problem GP is a simple and powerful technique whichhas been applied to a wide range of problems in combinatorial optimization, automaticprogramming and model induction The direct encoding of solutions as variable-lengthcomputer programs allows GP to provide solutions that can be evaluated and also examined

to understand their internal workings In the ﬁeld of evolutionary and natural computation,

GP plays a special role In the Koza’s seminal book, the ambition of GP was to evolve in apopulation of programs that can learn automatically from the training data Since its birth,

GP has been developed fast with various fruitful applications Some of the developmentsare in terms of the novel mechanisms regarding to GP evolutionary process, others were onﬁnding potential areas on which GP could be successfully applied Despite these advances

in GP research, when it is applied to learning tasks, the issue of generalization of GP, as alearning machine, has not been taken the attention that it deserves This thesis focuses onthe generalization aspect of GP and proposes mechanisms to improve GP generalizationability

Trang 16

0.1 MOTIVATIONS

GP is one of the evolutionary algorithms (EAs) The common underlying idea behind all

EA techniques is the same: given a population of individuals the environmental pressurecauses natural selection (survival of the ﬁttest) and this causes a rise in the ﬁtness of thepopulation Given a quality function to be maximised we can randomly create a set ofcandidate solutions, i.e., elements of the function’s domain, and apply the quality function

as an abstract ﬁtness measure - the higher the better Based on this ﬁtness, some of thebetter candidates are chosen to seed the next generation by applying recombination and/ormutation to them Recombination is an operator applied to two or more selected candidates(the so-called parents) and results one or more new candidates (the children) Mutation

is applied to one candidate and results in one new candidate Executing recombinationand mutation leads to a set of new candidates (the offspring) that compete - based ontheir fitness (and possibly age) - with the old ones for a place in the next generation Thisprocess can be iterated until a candidate with sufficient quality (a solution) is found or apreviously set computational limit is reached GP has been proposed as a machine learning(ML) method [53], and solutions to various learning problems are sought by means of anevolutionary process of program discovery Since GP often tackles ML issues, generalizationthat is synonymous with robustness has been the desirable property at the end of the GPevolutionary process The main goal when using GP or other ML techniques is not only tocreate a program that exactly cover the training examples (as in many GP applications sofar), but also to have a good generalization ability: for unseen and future samples of data,the program should output values that closely resemble the underlying function Although,recently, the issue of generalization in GP has received more attention than it deserves,the use of more traditional machine learning techniques has been rather limited It isstrange that generalization has been an old time studied topic in machine learning with

Trang 17

to facilitate and motivate future enhancements The current theoretical models of GPare limited in use and applicability due to their complexity Therefore, the majority oftheoretical work has been derived from experimentation The approach taken in this thesis

is also based on the careful designed experiments and the analysis of experimental results

This thesis makes the following contributions:

1 An early stopping approach for GP that is based on the estimate of the generalizationerror as well as propose some new criteria apply to early stopping for GP The GPtraining process will be stopped early at the time which promises to bring the solutionwith the smallest generalization error

2 A progressive sampling framework for GP, which divides GP learning process intoseveral layers with the training set size, starting from being small, gradually get

Trang 18

0.4 ORGANIZATION OF THE THESIS

bigger at each layer The termination of training on each layer will be based oncertain early stopping criteria

The organization of the rest of the thesis is as follows:

Chapter 1 ﬁrst introduces the basic components of GP - the algorithm, representation,

and operators as well as some benchmark problem domains (some are used for the periments in this thesis) Two important research issues and a metaphor of GP searchare then discussed Next, the major issues of GP are discussed especially when the GP

ex-is considered as a machine learning system It ﬁrst overviews the approaches proposed

in the literature that help to improve the generalization ability of GP Then, the solutioncomplexity (code bloat), in the particular context of GP, and its relations to GP learningperformance (generalization, training time, solution comprehensibility) are discussed

Chapter 2 provides a backgrounds on a number of concepts and techniques subsequently

used in the thesis for improving GP learning performance They include progressive pling, layer learning, and early stopping in machine learning

sam-Chapter 3 presents one of the main contributions of the thesis It proposes some

crite-ria used to determine when to stop the training of GP with an aim to avoid over-fitting.Experiments are conducted to assess the effectiveness of the proposed early stopping crite-ria The results emphasise that GP with early stopping could find solutions with at leastsimilar generalization errors with standard GP but in much shorter training time and lowercomplexity of solutions

Chapter 4 develops a learning framework for GP that is based layer learning and

pro-gressive sampling The results show that the solutions of GP with propro-gressive samplingsigniﬁcantly reduce in size and the training time is much faster when compared to standard

Trang 19

0.4 ORGANIZATION OF THE THESIS

GP while the generalization error of the obtained solutions is often smaller

Chapter 5 concludes the thesis summarizing the main results and proposing suggestions

for future works that extend the research in this thesis

Trang 20

Chapter 1

Backgrounds

GP [53] is an evolutionary algorithm (EA) that uses either a procedural or a functionalrepresentation This chapter describes the representation and the speciﬁc algorithm com-ponents used in the canonical version of GP The basics of GP are initially presented,followed by a discussion of the algorithm and a description of some variants of GP Then,

a survey of ML related issues of GP is given The chapter ends with a comprehensiveoverview of the literature on the approaches used to improve GP generalization ability

GP is an evolutionary paradigm that is inspired by biological evolution to ﬁnd the tions, in the form of computer programs, for a problem The form of GP, where a treerepresentation of programs was used in conjunction with subtree crossover to evolve a mul-tiplication function The evolution through each generation as described in Figure 1.1 GPadopts a similar search strategy as a genetic algorithm (GA), but uses a program represen-tation and special operators It is this representation that makes GP unique Algorithm 1shows the basic steps to run GP

Trang 21

solu-1.1 GENETIC PROGRAMMING

Fig 1.1: GP’s main loop [57]

Algorithm 1: Abstract GP Algorithm [57].

Randomly create an initial population of programs from the available primitives (seeSection 1.1.1.2)

repeat

Execute each program and ascertain its ﬁtness.

Select one or two program(s) from the population with a probability based on

ﬁtness to participate in genetic operations (see Section 1.1.1.3)

Create new individual program(s) by applying genetic operations with speciﬁed

probabilities (see Section 1.1.1.4)

until an acceptable solution is found or some other stopping conditions are met (e.g

reaching a maximum number of generations).;

return the best-so-far individual.

1.1.1 Representation, Initialization and Operators in GP

This section will detail the basic steps and terminologies in the process of running GP

In particular, it will deﬁne how the solution is represented in most GP systems in section 1.1.1.1, how to initialize the ﬁrst population(subsection 1.1.1.2), and how selection(subsection 1.1.1.3) as well as recombination and mutation(subsection 1.1.1.4) are used tocreate the new individuals in the standard GP (as proposed in [53])

Trang 22

sub-1.1 GENETIC PROGRAMMING

1.1.1.1 Representation

Traditionally, GP programs are expressed as expression trees Figure 1.2 shows, for

ex-ample, the tree representation of the program max(x*x,x+3*y). The variables and

constants in the program (x, y, and 3), are called terminals in GP, which are the leaves

of the tree The arithmetic operations (+, *, and max) are internal nodes (called

func-tions in the GP literature) The sets of allowed funcfunc-tions and terminals together form the

primitive symbol set of a GP system

Fig 1.2: GP expression tree representing max(x*x,x+3y) [57]

The terminal set The terminal set consists of all inputs to GP programs (individuals),constants supplied to GP programs, and the zero-argument functions with side-eﬀectsexecuted by GP programs

Inputs, constants and other zero-argument nodes are called terminals or leaves as theyappear at the frontier of expression tree-based programs By this view, GP is no diﬀerentfrom any other ML system, where users have to make decision on the set of features -inputs, and each of these inputs becomes part of the GP training and test sets as a GPterminal

The function set The function set is comprised of statements, operators, and functionsavailable to GP systems

Trang 23

For example: trigonometric and logarithmic functions.

• Variable Assignment Functions

• Indexed Memory Functions

• Conditional Statements

For example: IF, THEN, ELSE; CASE or SWITCH statements

• Control Transfer Statements

For example: GO TO, CALL, JUMP

Trang 24

1.1 GENETIC PROGRAMMING

with the scientiﬁc programming tools (e.g., MATLAB and Mathemtica) have widely beenused to implement GP systems They provide automatic garbage collection and dynamiclists for easy installation of the expression tree and the necessary GP operations In thisthesis, we implement and operate GP coded in C++

1.1.1.2 Initializing the Population

In GP, each individuals in the initial population are usually randomly generated There aremany possible methods to perform tree initialisation Two common ones are the Full andGrow methods In the Full generation method, only functions are selected as arguments

to other functions until a certain depth is reached, whereupon only terminals are selected.This method ensures that all branches of the tree will be of the same depth The Growmethod allows both functions and terminals to be selected, with program branches allowed

to be any depth, up to some given maximum depth When this depth is reached onlyterminals are selected This method allows trees with branches of diﬀerent depths to

be generated but ensures that the trees will never become larger than a certain depth.Pseudo code for a recursive implementation of both the Full and Grow methods is given

in Algorithm 2

Fig 1.3: Creation of a full tree having maximum depth 2 (and therefore a total of seven

nodes) using the Full initialisation method (t=time) [57]

Trang 25

Algorithm 2: Pseudo code for recursive generation of trees with the ”Full” and

”Grow” methods [57]

P rocedure : gen rnd expr(f unc set, term set, max d, method)

expr = choose random element(term set)

else if then

f unc = choose random element(f unc set) ;

for i ← 1 to arity(func) do

arg i = gen rnd expr (f unc set, term set, max d − 1, method);

expr = (f unc, arg1, arg2, ) ;

return expr;

Notes: func set is a function set, term set is a terminal set, max d is the maximum allowed depth for expression trees, the method is either ”Full” or ”Grow” and expr

is the generated expression in its preﬁx notation

The Full and Grow tree generation methods could be combined in the commonly usedRamped half-and-half method Here half the initial population is constructed using theFull method and the other half is constructed using the Grow method This is done byusing a range of depth limits (hence termed as ”ramped”) to ensure that generated expres-sion trees have variety of sizes and shapes While these methods are easy to implementand use, they often make it diﬃcult to control the statistical distributions of importantproperties such as the sizes and shapes of the generated trees Other initialisation mecha-nisms, however, have been developed (e.g., [62]) to allow better control of these properties

in instances if such control is important

Trang 26

Fig 1.4: Creation of a ﬁve node tree using the Grow initialisation method with a maximum

depth of 2 (t=time) A terminal is chosen at t = 2, causing the left branch of the root to

be terminated at that point even though the maximum depth has not been reached [57]

On the other hand, the initial population does not need to be completely random Ifsomething is known about the likely properties of the desired solution(s), the expressiontrees that have these characteristics can be used as seeds to generate the initial population.These seeding individuals (programs) can be created by humans based on the knowledge ofthe problem domain, or they can be the results of previous GP runs However, unwantedsituations may occur when the second generation comprise of mostly the oﬀsprings of asingle or very small number of seeding individuals Diversity protection techniques, such asmulti-objective GP (e.g., [61]), demes [3], ﬁtness sharing [24] and the use of multiple seedingtrees, might be used In any case, the diversity of the population should be supervised toensure that there are diverge number of trees in the initial population

1.1.1.3 Fitness and Selection

Like in most other EAs, genetic operators in GP are applied to individuals that are abilistically selected based on their fitness Fitness is the measure used by GP to indicatehow well a program has learned to predict the output(s) from the input(s) [6] The goal ofhaving a fitness evaluation is to give the feedback to learning algorithms For this purpose,the fitness function should be designed to give graded and continuous feedback about howwell a GP program (individual) perform on the training set With fitness based selec-

Trang 27

where o i is the output of the i th in the training set, p i is the output from GP program p

on the i th example from the training set of n examples

Squared Error Fitness Function An alternate ﬁtness function is the sum of the

squared diﬀerences between p i and o i, called the squared error:

1.1.1.4 Recombination

Where GP departs signiﬁcantly from other EAs is in the implementation of its operators crossover and mutation The most commonly used form of crossover in GP is the subtreecrossover It is conducted in the following steps:

-• Choose two individuals as parents that is based on reproduction selection strategy.

Trang 28

• Select a random subtree in each parent The subtrees constituting terminals are

selected with lower probability than other subtrees

• Swap the selected subtrees between the two parents The resulting individuals are

the children

Crossover operation is illustrated in Figure 1.5

Fig 1.5: Example of subtree crossover.

The majority of the nodes of GP tree will be leaves Consequently, the uniform tion of crossover points leads to crossover operations frequently exchange only very smallamounts of genetic materials (i.e., small subtrees); many crossovers may in fact reduce

selec-to simply swapping two leaves Therefore, crossover points are usually not selected withuniform probability, Koza suggested the widely used approach of choosing functions 90%

of the time, while leaves are selected only 10% of the time While subtree crossover iscommonly used in GP, then there are some variants of this operator is deﬁned in a num-ber of articles For example, one-point crossover [86, 4] works by selecting a commoncrossover point in the parent programs and then swapping the corresponding subtrees Incontext-preserving crossover [21], the crossover points are constrained to have the samecoordinates, like in one-point crossover Size-fair crossover and homologous crossover were

Trang 29

described in [56] In size-fair crossover the ﬁrst crossover point is selected randomly like inthe standard subtree crossover Then, the size of the subtree to be excised from the ﬁrstparent is calculated This is used to constrain the choice of the second crossover point so

as to guarantee that the subtree excised from the second parent will not be ”unfairly” big.The homologous crossover operator works identically to the size-fair crossover operator up

to the last step Instead of randomly choosing between all the available subtrees in thesecond parent of the desired size in homologous crossover they deterministically choose theone closest to the subtree in the ﬁrst parent

1.1.1.5 Mutation

The mutation operator is used to create a random change to an existing program tree Instandard point subtree mutation used in this thesis, an individual is selected for mutationusing ﬁtness proportional selection A single function or terminal is selected at randomfrom among the set of function and terminals making up the original individual as the point

of mutation The mutation point, along with the subtree stemming from mutation point,

is then removed from the tree, and replaced with the new, randomly generated subtree.The new subtree being inserted into the tree to form a new individual is created using the

”grow” method of tree construction as was used to generate an entire tree described inthis section Figure 1.6 depicts the process Subtree mutation is sometimes implemented

as crossover between a program and a newly generated random program; this operation

is also known as ”headless chicken” crossover [1] Another alternative form of mutation ispoint mutation, which is the rough equivalent for GP of the bit-ﬂip mutation used in GAs

In point mutation, a random node is selected and the primitive stored there is replacedwith a diﬀerent random primitive of the same arity taken from the primitive set (functionsand terminals) If no other primitives with that arity exist, nothing happens to that node(but other nodes may still be mutated) Note that, when subtree mutation is applied, this

Trang 30

involves the modiﬁcation of exactly one subtree Point mutation, on the other hand, istypically applied with a given mutation rate on a per-node basis, allowing multiple nodes

to be mutated independently

Fig 1.6: Example of subtree mutation [57]

There are a number of mutation operators which treat constants in GP programs asspecial cases [95] mutates constants by adding Gaussian noise to them However, othersuse a variety of potentially expensive optimization tools to try and ﬁne tune an existingprogram by ﬁnding the ”best” values for constants within it E.g., [96] uses ”a numericalpartial gradient ascent to reach the nearest local optimum” to modify all constants in

a program, while [98] uses simulated annealing to stochastically update numerical valueswithin individuals

While mutation is not necessary for GP to solve many problems, [79] argues that, in somecases, GP with mutation alone can perform as well as GP using crossover While mutationwas rarely used in early GP work, it is more widely used in GP today, especially in modelingapplications

1.1.1.6 Reproduction

The reproduction operator copies an individual from one population into the next

Trang 31

1.1.1.7 Stopping Criterion

Traditionally, GP uses a generational approach Each generation is represented by a plete population of individuals The newer population is created from and then replacesthe older population Alternatively, steady-state approach could be used, in which there is

com-a continuous flow of individucom-als selecting, mcom-ating, com-and offspring producing The offspringreplace existing individuals in the same population A maximum number of generationsusually defines the stopping criterion in GP However, when it is possible to achieve anideal individual (with ideal fitness), this can also stop GP evolutionary process In thisthesis, some other criteria are proposed, such as a measure of generalization loss, a lack offitness improvement, or a run of generations with over-fitting

of words, containing a whole number of instructions are typically swapped over Similarly,mutation operations normally respect word boundaries and generate legal machine codes

Trang 32

However, linear GP lends itself to a variety of other genetic operators If the goal isexecution speed, then the evolved code should be machine code for a real computer ratherthan some higher level language or virtual-machine code Peter Nordin started this trend

by evolving machine codes for SUN computers [74] Ron Crepeau [14] targeted the Z80.Kwong Sak Leung’s linear GP [58] was ﬁrmly targeted at novel hardware but much ofthe GP development had to be run in simulation whilst the hardware itself was underdevelopment

a generalization of standard GP However, more complex representations can be used,which allow the exploration of a large space of possible programs including standard tree-like programs, logic networks, neural networks, recurrent transition networks and finitestate automata In PDGP, programs are manipulated by special crossover and mutationoperators which guarantee the syntactic correctness of the offspring For this reason PDGPsearch is very efficient PDGP programs can be executed in different ways, depending

on whether nodes with side effects are used or not In a system called PADO (ParallelAlgorithm Discovery and Orchestration), Teller and Veloso [107]used a combination of GPand linear discrimination to obtain parallel classification programs for signals and images.The programs in PADO are represented as graphs, although their semantics and executionstrategy are very different from those of PDGP

Trang 33

1.2 AN EXAMPLE OF PROBLEM DOMAIN

1.1.2.3 Grammar Based GP

In Grammar Based GP, each program is a derivation tree generated by a grammar; thegrammar, often context-free, deﬁnes a language whose terms are expressions corresponding

to those used in standard tree-based GP Grammars bring a number of beneﬁts to GP, one

of the most important is that it helps to restrict the search space with prior knowledge fromusers This restriction is done via the crafting of grammar production rules GrammarBased GP has found a large and varied range of real-world applications, for example inecological modeling, medicine, ﬁnance, and design For more details on the history anddevelopments of this subﬁeld, readers are referred to [68] and references therein

The domain of symbolic regression is one of the most common problem domain in GP Insymbolic regression input and output pairs are used to infer a functional model in symbolicform GP is asked to ﬁnd a program that approximates a target (unknown) function A

target function f (x) = y is applied to domain values, x, in a pre-determined range The

resulting values are then compared with the candidate program’s value upon the samevalues This thesis uses the 10 polynomials in Table 1.1 with white (Gaussian) noise of

standard deviation 0.01 It means each target function is F (x) = f (x) + ϵ 4 more

real-world data sets are from the UCI machine learning repository [7] and StatLib [116] arealso used as shown in Table 1.2

Machine learning is programming computers to optimize a performance criterion usingexample data or past experience [69] We have a model deﬁned up to some parameters,

Trang 34

Tab 1.1: Test Functions

Data sets Features Size Source Concrete Compressive Strength 9 1030 UCI

in training, we need eﬃcient algorithms to solve the optimization problem, as well as tostore and process the massive amount of data we generally have Second, once a model

is learned, its representation and algorithmic solution for inference needs to be efficient aswell In certain applications, the efficiency of the learning or inference algorithm, namely,its space and time complexity, may be as important as its predictive accuracy There-fore, there are several possible criteria for evaluating a learning algorithm as evaluatingperformance: predictive accuracy of classifier, speed of learner, speed of classifier, spacerequirements Therein, predictive accuracy is most common criterion Generalization is animportant issue in ML The generalization ability of learning machine depends on a balancebetween the information in the training examples, and the complexity of the models Bad

Trang 35

1.4 GP AND MACHINE LEARNING ISSUES

generalization occurs if

• Over-ﬁtting of data: the information does not match the complexity - the training

data set is too small, or the complexity of the models is too large

• Under-ﬁtting of data: the opposite situation.

GP is one of the only two ML techniques explicitly able to represent and learn relational(or ﬁrst-order) knowledge (the other being Inductive Logic Programming (ILP)and itsextended variants) GP has been successfully applied in a wide range of ML problems

In spite of its success, GP remains a topic of open research and development, with manyaspects being still theoretically unexplainable, and many big questions for future answering.There are complex feasibility constraints (in most systems, not all genotypes are legal),and complex ﬁtness dependencies (epistases) This section explores some open questions

in GP research from ML perspective, which motivates the research in this thesis

1.4.1 Search Space

GP is a search technique that explores the space of computer programs The search forsolutions to a problem starts from a group of points (random programs) in this search space.Those points that are above average quality are then used to generate a new generation

of points through crossover, mutation, reproduction and possibly other genetic operations.This process is repeated over and over again until a stopping criterion is satisﬁed Weshall show how the space of all possible programs, i.e the space which GP searches,scales Particularly how it changes with respect to the size of program We will show,

in general, that above some problem dependent threshold, considering all programs, their

Trang 36

1.4 GP AND MACHINE LEARNING ISSUES

ﬁtness shows little variation with their size The distribution of ﬁtness levels, particularlythe distribution of solutions, gives us directly the performance of random search In theimplementation of GP the maximum depth of a tree is not the only parameter that limitsthe search space of GP, an alternative could be the maximum number of nodes of anindividual or both (depth and size)

1.4.2 Bloat

In the course of evolution, the average size of the individual in the GP population often getincreased largely, even at some point it would start growing at a rapid pace Typically theincrease in program size is not accompanied by any corresponding increase in fitness Theorigin of this phenomenon, which is know as bloat [5, 63], has effectively been a subject ofresearch for over a decade In fact that, the increase in size of the individuals with eachgeneration is perfectly reasonable to be able to comply with all the fitness cases because

of the initialization randomly for the GP population So, bloat is not equal to growth

We should only talk of bloat when there is growth without (signiﬁcant) return in terms

of fitness Because of its practical effects (large programs are hard to interpret, may havepoor generalization and are computationally expensive to evolve and later use), bloat hasbeen a subject of intense study in GP [22, 87, 71, 99, 106, 67], As a result, many theorieshave been proposed to explain bloat: replication accuracy theory, removal bias theory,nature of program search spaces theory, etc There is a great deal of confusion in the field

as to the reasons of (and the remedies for) bloat

1.4.3 Generalization and Complexity

Generalization is not any more important for genetic-based methods than it is for tional learning methods The power of learning lies in the fact that learning systems can

Trang 37

conven-1.5 GENERALIZATION IN GP

extract implicit relationships that may exist in the inputoutput mapping or the interaction

of the learner with the environment Learning such an implicit relationship only duringthe training phases may not be suﬃcient A generalization of learned tasks or behaviorsmay be essential for several reasons The earlier work in GP focused on solving problems

on the training data set, without considering that the over-fiting on the training set willmake the solution worse in fitting future data Therefore, the generalization ability of GPhas been paid more attention in recent works [13, 33, 111] Most of research in improvinggeneralization ability of the GP has concentrated on avoiding over-fitting on training data.One of main cause of over-fitting has been identified as the ”complexity” of the hypothesisgenerated by the learning algorithm When GP trees grow up to fit or to specialize on ”dif-ficult” learning cases, it will no longer have the ability to generalize further, this is entirelyconsistent with the principle of Occam’s razor [52] stating that the simple solutions shouldalways be preferred The relationship between generalization and individual complexity in

GP has often been studied in the context of code bloat, which is extraordinary enlargement

of the complexity of solutions without increasing their ﬁtness

Because of the aforementioned reasons, the following section will synthesize the niques used to improve the generalization ability of GP before going on to introducepromising technique can increase this ability of GP

GP has been proposed as a ML method and the core objective of a learner is to generalizefrom its experience The training examples from its experience come from some generallyunknown probability distribution and the learner has to extract from them somethingmore general, something about that distribution, that allows it to produce useful answers

Trang 38

1.5.1 Overview of Studies on GP Generalization Ability

In [54] generalization is deﬁned as follows

Deﬁnition 1 Given a learner, and its performance after it is ”trained” on a subset of X,

if exhibits a satisfactory performance on another set of instances belonging to that does not contain instances used in the training phase, the learner is said to have generalized over this new set of instances of.

Although achieving high generalization ability is the main objective any learning chine, it was not seriously considered in the ﬁeld of GP for a long time Before Kushchupublished his seminal paper on the generalization ability of GP [54], there was rather fewworks in the literature to deal with the GP generalization aspect Common to most of theresearch above are the problems with obtaining generalization in GP and the attempts toovercome these problems These approaches can be categorized as follows:

ma-• Using training and testing and in order to promote generalization in supervised

learn-ing problems

• Changing training instances from one generation to another and evaluation of

perfor-mance based on subsets of training instances are suggested, or consider to combine

GP with ensemble learning techniques

Trang 39

1.5 GENERALIZATION IN GP

• Changing the formal implementation of GP: representation GP, genetic operators

and selection strategy

1.5.2 Methods for Improving GP Generalization Ability

This section brieﬂy overview some main approaches in the literature for improving thegeneralization ability of GP They are organized in the categories mentioned in the previoussubsection

1.5.2.1 Sampling Based Approaches

Sampling based approaches are data-driven methods for improving GP learning

In [35], the author proposed a simple sampling strategy for the training of GP It forcedGP’s attention into the hard fitness cases (samples), i.e the ones which are frequentlymis-classified and also involve fitness cases which have not been worked on for severalgenerations In each generation, a subset of training samples is selected by the followingtwo passes through the full training set In the first pass of the entire training set, of size

T , at a generation g, each training case, i, is assigned a weight, W i, which is the sum of

its current ”learning hardness”, D i , which is exponentiated to a certain power, d, and the number of generation since it was last selected (or age), A i, which also exponentiated to a

Trang 40

1.5 GENERALIZATION IN GP

target subset size, S:

∀i : 1 ≤ i ≤ T, P i (g) = ∑W i (g) ∗S

T j=1 W j (g) This scaled probability ensures that the average selected

subset size is the same with the size of the original training sample set If a case, i, is selected to be in the subset, then its hardness, D i , and age, A i, are set to 0, otherwise its

hardness remains unchanged, and its age, A i is incremented While testing each member

of the GP population against each ﬁtness case in the current training subset, the hardness,

D i, (starting from 0) is incremented each time the ﬁtness case is mis-classiﬁed by one ofthe GP trees The summary results showed that the GP with this scheme of dynamicsubset selection (DSS) produced results as good as those of the standard GP and in amuch shorter time

A good example of experimental attempts to increase the robustness of the programsevolved by GP is presented in the work of Moore and Garcia [70] The system proposed inthe paper evolves optimized manoeuvres for the pursuer/evader problem The results ofthis study suggest that using of a fixed set of training cases may result in brittle solutionsdue to the fact that such fixed fitness cases may not be representative of possible situationsthat the agent may encounter It was shown that the use of randomly generated fitnesscases at the end of each generation can reduce the brittleness of the solutions when testedagainst a set of large representative situations after the evolution

Byoung-Tak Zhang [122] introduced incremental data inheritance (IDI for short), whichevolves programs using program speciﬁc subsets of given data that also evolve incrementally

as generation goes on The basic idea of IDI method on GP is that programs and theirdata are evolved at the same time: each program is learned on separate data set Thus,IDI separates two populations: program population and data population The program

population A(g) consists of individuals A i (g) and the data population D(g) consists of data individuals D i (g), where g is the generation This method presented a Bayesian framework

Tiêu đề	A Study of Generalization in Genetic Programming
Tác giả	Nguyen Thi Hien
Người hướng dẫn	Assoc/Prof. Dr. Nguyen Xuan Hoai, Prof. Dr. R.I. (Bob) McKay
Trường học	Military Technical Academy
Chuyên ngành	Applied Mathematics and Computer Science
Thể loại	Thesis
Năm xuất bản	2014
Thành phố	Hanoi

Định dạng
Số trang	135
Dung lượng	1,11 MB