The goal of this thesis is to vestigate the aspects that affect the generalization ability of GP as well as techniques frommore traditional machine learning literature that help to improv
Trang 1MINISTRY OF EDUCATION AND TRAINING MINISTRY OF NATIONAL DEFENSE
MILITARY TECHNICAL ACADEMY
NGUYEN THI HIEN
Trang 2MINISTRY OF EDUCATION AND TRAINING MINISTRY OF NATIONAL DEFENSE
MILITARY TECHNICAL ACADEMY
1 Assoc/Prof Dr Nguyen Xuan Hoai
2 Prof Dr R.I (Bob) McKay
Trang 4Genetic Programming (GP) is a meta-heuristic technique that simulates biological tion GP differs to other Evolutionary Algorithms (EAs) in that it generates solutions toproblems in the form of computer programs The quality criterion for each individual isoften referred to as fitness in EAs and it is with which we determine which individualsshall be selected GP induces a population of computer programs that improve their fit-ness automatically as they experience the data on which they are trained According this,
evolu-GP can be seen as one of the machine learning methods Machine learning is a processthat trains a system over some training data for capturing the succinct causal relationshipsamong data The key parts of this process are the ”learning domain”, the ”training set”,the ”learning system” and the ”testing” (validating) of the learnt results The quality of alearnt solution is its ability to predict the outputs from future unseen data (often simulated
on a test set), which is often referred to as ”generalization” The goal of this thesis is to vestigate the aspects that affect the generalization ability of GP as well as techniques frommore traditional machine learning literature that help to improve the generalization andlearning efficiency of GP In particular, a number of early stopping criteria were proposedand tested The early stopping techniques (with the proposed stopping criteria) helped toavoid over-fitting in GP producing much simpler solutions in much shorter training time.Over-fitting generally occurs when a model is excessively complex, such as having too manyparameters relative to the number of observations A model which has been over-fit will
Trang 5in-generally have poor predictive performance, as it can exaggerate minor fluctuations in thedata The thesis also proposes a framework to combine progressive sampling, a traditionalmachine learning technique, with early stopping for GP The experimental results showedthat progressive sampling in combination with early stopping help to further improve thelearning efficiency of GP (i.e to avoid over-fitting and produce simpler solutions in muchshorter training time) The early stopping techniques (with the proposed stopping crite-ria) helped to avoid over-fitting in GP producing much simpler solutions in much shortertraining time.
Trang 6The first person I would like to thank is my supervisor for directly guiding me through thePhD progress He’s enthusiasm is the power source to motivate me in this work His guidehas inspired much of the research in this thesis
I also wish to thank my co-supervisor Although staying far away, we usually had line research chats Working with him, I have learnt how to do research systematically Inparticular, the leaders of the Department of Software Engineering and Faculty of Informa-tion Technology, Military Technical Academy have frequently supported my research withregards to facilities and materials for running the experiments Dr Nguyen Quang Uy hasdiscussed a number of issues related to this research with me I would like to thank himfor his thorough comments and suggestions on my research
on-I also would like to acknowledge the supports from my family, especially my parents,Nguyen Van Hoan and Nguyen Thi Quy, who have worked hard and always believedstrongly in their children and to my husband, Nguyen Hai Trieu for sharing happiness anddifficulty in the life with me To my beloved daughter Nguyen Thi Hai Phuong, who wasborn before this thesis was completed, I would like to express my thanks for being such agood girl always cheering me up
Trang 70.1 Motivations 2
0.2 Research Perspectives 3
0.3 Contributions of the Thesis 3
0.4 Organization of the Thesis 4
1 Backgrounds 6 1.1 Genetic Programming 6
1.1.1 Representation, Initialization and Operators in GP 7
1.1.2 Some Variants of GP 17
1.2 An Example of Problem Domain 19
1.3 Machine Learning 19
1.4 GP and Machine Learning Issues 21
1.4.1 Search Space 21
1.4.2 Bloat 22
Trang 81.4.3 Generalization and Complexity 22
1.5 Generalization in GP 23
1.5.1 Overview of Studies on GP Generalization Ability 24
1.5.2 Methods for Improving GP Generalization Ability 25
1.5.3 Problem of Improving Training Efficiency 33
1.6 Summary 35
2 Early Stopping, Progressive Sampling, and Layered Learning 36 2.1 Over-training, Early Stopping and Regularization 36
2.1.1 Over-training 37
2.1.2 Early Stopping 37
2.1.3 Regularization Learning 39
2.2 Progressive Sampling 41
2.2.1 Types of Progressive Sampling Schedule 44
2.2.2 Detection of Convergence 47
2.3 Layered Learning 49
2.4 Conclusion 50
3 An Empirical Study of Early Stopping for GP 51 3.1 Method 53
3.1.1 Proposed Classes of Stopping Criteria 53
3.2 Experimental Settings 57
3.3 Results and Discussions 59
3.3.1 Comparison with GP 59
3.3.2 Comparison with Tarpeian 76
3.4 Conclusion 77
Trang 94 A Progressive Sampling Framework for GP - An Empirical Approach 79
4.1 The Proposed Method 82
4.1.1 The Sampling Schedule 84
4.1.2 The Initial Sample Set Size 84
4.1.3 The Number of Learning Layers 86
4.1.4 The Stopping Criterion for Each Layer 86
4.2 Experimental Settings 87
4.2.1 Data Preparation 87
4.2.2 GP Systems Configurations 87
4.3 Results and Discussions 89
4.3.1 Learning Effectiveness Comparison 89
4.3.2 Learning Efficiency 91
4.3.3 Solution Complexity 92
4.3.4 Distribution of Best Solutions by Layers 93
4.3.5 Synergies between System Components 94
4.4 Conclusion 98
5 Conclusions and Future Work 99 5.1 Contributions 100
5.2 Future Directions 101
Trang 10List of Figures
1.1 GP’s main loop [57] 71.2 GP expression tree representing max(x*x,x+3y) [57] 81.3 Creation of a full tree having maximum depth 2 (and therefore a total of seven nodes) using the Full initialisation method (t=time) [57] 101.4 Creation of a five node tree using the Grow initialisation method with a maximum depth of 2 (t=time) A terminal is chosen at t = 2, causing the left branch of the root to be terminated at that point even though the maximum depth has not been reached [57] 121.5 Example of subtree crossover. 141.6 Example of subtree mutation [57] 162.1 Idealized training and validation error curves Vertical axis: errors; hori- zontal axis:time [88] 382.2 A real validation error curve Vertical: validation set error; horizontal: time (in training epochs) [88] . 382.3 Learning curves and progressive sample. 442.4 Regions of the Efficiency of Arithmetic and Geometric Schedule Sampling [89] 46
3.1 Distribution of stop time (last generation of the run) of the OF criterion and GPM 74
Trang 113.2 Distribution of stop time (last generation of the run) of the VF criterion and GPM 743.3 Distribution of stop time (last generation of the run) of the GL criterion and GPM 753.4 Distribution of stop time (last generation of the run) of the PQ criterion and GPM 753.5 Distribution of stop time (last generation of the run) of the UP criterion and GPM 754.1 PS framework for GP . 844.2 GP Evolution of each layer 87
Trang 12List of Tables
1.1 Test Functions 201.2 The real-world data sets 203.1 Parameter settings for the GP systems 583.2 Data Sets for Problems For the synthetic problems, the notation [min, max] defines the range from which the data points are sampled 593.3 Generalization error and run time of VF, OF, True Adap stopping criterionand GP, Tar 613.4 Size of best individual of VF, OF and True Adap stopping criterion and GP,Tar 623.5 p-value of run time and GE of VF, OF, True Adap stopping criterion andTar vs GP 633.6 p-value of complexity of VF, OF, True Adap stopping criterion and Tar vs
GP 643.7 Relative Generalization Errors and Run Time of OF, VF and True Adapstopping criterion (vs GP) 653.8 Generalization error and run time of GL, PQ and UP stopping criterion 663.9 Size of the best individual of GL, PQ and UP stopping criterion 673.10 p-value of GL, PQ and UP stopping criterion vs GP 68
Trang 133.11 p-value of size of best individual of comparison GL, PQ and UP stopping
criterion with GP 69
3.12 Relative Generalization Errors and Run Time of GL, PQ and UP stopping criterion (vs GP) 70
3.13 p-value of GE, run time and size of best solution of True Adap, PQ vs Tarpeian 77 4.1 Data Sets for the Test Functions 88
4.2 Evolutionary Parameter Settings for the Genetic Programming Systems 89
4.3 Mean and SD of Generalization Error, GPLL and GP (14 Problems) 90
4.4 Relative Average Generalization Errors and p-values (GPLL vs GP) (14 Problems) 90
4.5 Mean Run Time, GPLL and GP (14 Problems) 91
4.6 Relative Run Times and p-values (GPLLs vs GP) (14 Problems) 92
4.7 Mean Solution Size and p-values, GPLL vs GP (14 Problems) 93
4.8 Number of End-of-run Best Solutions First Found by GPLL in Each Layer 94 4.9 Average Generalization Error on 14 Problems 95
4.10 Average Run Time of All Systems 96
4.11 Average Size of Solutions Found by all Systems 97
Trang 15Genetic programming (GP) paradigm was first proposed by Koza [53] can be viewed as thediscovery of computer programs which produce desired outputs for particular inputs Acomputer program could be an expression, formula, plan, control strategy, decision tree or
a model depending on the type of problem GP is a simple and powerful technique whichhas been applied to a wide range of problems in combinatorial optimization, automaticprogramming and model induction The direct encoding of solutions as variable-lengthcomputer programs allows GP to provide solutions that can be evaluated and also examined
to understand their internal workings In the field of evolutionary and natural computation,
GP plays a special role In the Koza’s seminal book, the ambition of GP was to evolve in apopulation of programs that can learn automatically from the training data Since its birth,
GP has been developed fast with various fruitful applications Some of the developmentsare in terms of the novel mechanisms regarding to GP evolutionary process, others were onfinding potential areas on which GP could be successfully applied Despite these advances
in GP research, when it is applied to learning tasks, the issue of generalization of GP, as alearning machine, has not been taken the attention that it deserves This thesis focuses onthe generalization aspect of GP and proposes mechanisms to improve GP generalizationability
Trang 160.1 MOTIVATIONS
GP is one of the evolutionary algorithms (EAs) The common underlying idea behind all
EA techniques is the same: given a population of individuals the environmental pressurecauses natural selection (survival of the fittest) and this causes a rise in the fitness of thepopulation Given a quality function to be maximised we can randomly create a set ofcandidate solutions, i.e., elements of the function’s domain, and apply the quality function
as an abstract fitness measure - the higher the better Based on this fitness, some of thebetter candidates are chosen to seed the next generation by applying recombination and/ormutation to them Recombination is an operator applied to two or more selected candidates(the so-called parents) and results one or more new candidates (the children) Mutation
is applied to one candidate and results in one new candidate Executing recombinationand mutation leads to a set of new candidates (the offspring) that compete - based ontheir fitness (and possibly age) - with the old ones for a place in the next generation Thisprocess can be iterated until a candidate with sufficient quality (a solution) is found or apreviously set computational limit is reached GP has been proposed as a machine learning(ML) method [53], and solutions to various learning problems are sought by means of anevolutionary process of program discovery Since GP often tackles ML issues, generalizationthat is synonymous with robustness has been the desirable property at the end of the GPevolutionary process The main goal when using GP or other ML techniques is not only tocreate a program that exactly cover the training examples (as in many GP applications sofar), but also to have a good generalization ability: for unseen and future samples of data,the program should output values that closely resemble the underlying function Although,recently, the issue of generalization in GP has received more attention than it deserves,the use of more traditional machine learning techniques has been rather limited It isstrange that generalization has been an old time studied topic in machine learning with
Trang 17to facilitate and motivate future enhancements The current theoretical models of GPare limited in use and applicability due to their complexity Therefore, the majority oftheoretical work has been derived from experimentation The approach taken in this thesis
is also based on the careful designed experiments and the analysis of experimental results
This thesis makes the following contributions:
1 An early stopping approach for GP that is based on the estimate of the generalizationerror as well as propose some new criteria apply to early stopping for GP The GPtraining process will be stopped early at the time which promises to bring the solutionwith the smallest generalization error
2 A progressive sampling framework for GP, which divides GP learning process intoseveral layers with the training set size, starting from being small, gradually get
Trang 180.4 ORGANIZATION OF THE THESIS
bigger at each layer The termination of training on each layer will be based oncertain early stopping criteria
The organization of the rest of the thesis is as follows:
Chapter 1 first introduces the basic components of GP - the algorithm, representation,
and operators as well as some benchmark problem domains (some are used for the periments in this thesis) Two important research issues and a metaphor of GP searchare then discussed Next, the major issues of GP are discussed especially when the GP
ex-is considered as a machine learning system It first overviews the approaches proposed
in the literature that help to improve the generalization ability of GP Then, the solutioncomplexity (code bloat), in the particular context of GP, and its relations to GP learningperformance (generalization, training time, solution comprehensibility) are discussed
Chapter 2 provides a backgrounds on a number of concepts and techniques subsequently
used in the thesis for improving GP learning performance They include progressive pling, layer learning, and early stopping in machine learning
sam-Chapter 3 presents one of the main contributions of the thesis It proposes some
crite-ria used to determine when to stop the training of GP with an aim to avoid over-fitting.Experiments are conducted to assess the effectiveness of the proposed early stopping crite-ria The results emphasise that GP with early stopping could find solutions with at leastsimilar generalization errors with standard GP but in much shorter training time and lowercomplexity of solutions
Chapter 4 develops a learning framework for GP that is based layer learning and
pro-gressive sampling The results show that the solutions of GP with propro-gressive samplingsignificantly reduce in size and the training time is much faster when compared to standard
Trang 190.4 ORGANIZATION OF THE THESIS
GP while the generalization error of the obtained solutions is often smaller
Chapter 5 concludes the thesis summarizing the main results and proposing suggestions
for future works that extend the research in this thesis
Trang 20Chapter 1
Backgrounds
GP [53] is an evolutionary algorithm (EA) that uses either a procedural or a functionalrepresentation This chapter describes the representation and the specific algorithm com-ponents used in the canonical version of GP The basics of GP are initially presented,followed by a discussion of the algorithm and a description of some variants of GP Then,
a survey of ML related issues of GP is given The chapter ends with a comprehensiveoverview of the literature on the approaches used to improve GP generalization ability
GP is an evolutionary paradigm that is inspired by biological evolution to find the tions, in the form of computer programs, for a problem The form of GP, where a treerepresentation of programs was used in conjunction with subtree crossover to evolve a mul-tiplication function The evolution through each generation as described in Figure 1.1 GPadopts a similar search strategy as a genetic algorithm (GA), but uses a program represen-tation and special operators It is this representation that makes GP unique Algorithm 1shows the basic steps to run GP
Trang 21solu-1.1 GENETIC PROGRAMMING
Fig 1.1: GP’s main loop [57]
Algorithm 1: Abstract GP Algorithm [57].
Randomly create an initial population of programs from the available primitives (seeSection 1.1.1.2)
repeat
Execute each program and ascertain its fitness.
Select one or two program(s) from the population with a probability based on
fitness to participate in genetic operations (see Section 1.1.1.3)
Create new individual program(s) by applying genetic operations with specified
probabilities (see Section 1.1.1.4)
until an acceptable solution is found or some other stopping conditions are met (e.g
reaching a maximum number of generations).;
return the best-so-far individual.
1.1.1 Representation, Initialization and Operators in GP
This section will detail the basic steps and terminologies in the process of running GP
In particular, it will define how the solution is represented in most GP systems in section 1.1.1.1, how to initialize the first population(subsection 1.1.1.2), and how selection(subsection 1.1.1.3) as well as recombination and mutation(subsection 1.1.1.4) are used tocreate the new individuals in the standard GP (as proposed in [53])
Trang 22sub-1.1 GENETIC PROGRAMMING
1.1.1.1 Representation
Traditionally, GP programs are expressed as expression trees Figure 1.2 shows, for
ex-ample, the tree representation of the program max(x*x,x+3*y). The variables and
constants in the program (x, y, and 3), are called terminals in GP, which are the leaves
of the tree The arithmetic operations (+, *, and max) are internal nodes (called
func-tions in the GP literature) The sets of allowed funcfunc-tions and terminals together form the
primitive symbol set of a GP system
Fig 1.2: GP expression tree representing max(x*x,x+3y) [57]
The terminal set The terminal set consists of all inputs to GP programs (individuals),constants supplied to GP programs, and the zero-argument functions with side-effectsexecuted by GP programs
Inputs, constants and other zero-argument nodes are called terminals or leaves as theyappear at the frontier of expression tree-based programs By this view, GP is no differentfrom any other ML system, where users have to make decision on the set of features -inputs, and each of these inputs becomes part of the GP training and test sets as a GPterminal
The function set The function set is comprised of statements, operators, and functionsavailable to GP systems
Trang 23For example: trigonometric and logarithmic functions.
• Variable Assignment Functions
• Indexed Memory Functions
• Conditional Statements
For example: IF, THEN, ELSE; CASE or SWITCH statements
• Control Transfer Statements
For example: GO TO, CALL, JUMP
Trang 241.1 GENETIC PROGRAMMING
with the scientific programming tools (e.g., MATLAB and Mathemtica) have widely beenused to implement GP systems They provide automatic garbage collection and dynamiclists for easy installation of the expression tree and the necessary GP operations In thisthesis, we implement and operate GP coded in C++
1.1.1.2 Initializing the Population
In GP, each individuals in the initial population are usually randomly generated There aremany possible methods to perform tree initialisation Two common ones are the Full andGrow methods In the Full generation method, only functions are selected as arguments
to other functions until a certain depth is reached, whereupon only terminals are selected.This method ensures that all branches of the tree will be of the same depth The Growmethod allows both functions and terminals to be selected, with program branches allowed
to be any depth, up to some given maximum depth When this depth is reached onlyterminals are selected This method allows trees with branches of different depths to
be generated but ensures that the trees will never become larger than a certain depth.Pseudo code for a recursive implementation of both the Full and Grow methods is given
in Algorithm 2
Fig 1.3: Creation of a full tree having maximum depth 2 (and therefore a total of seven
nodes) using the Full initialisation method (t=time) [57]
Trang 251.1 GENETIC PROGRAMMING
Algorithm 2: Pseudo code for recursive generation of trees with the ”Full” and
”Grow” methods [57]
P rocedure : gen rnd expr(f unc set, term set, max d, method)
if max d = 0 or (method = grow and rand() < |term set|+|func set| |term set| then
expr = choose random element(term set)
else if then
f unc = choose random element(f unc set) ;
for i ← 1 to arity(func) do
arg i = gen rnd expr (f unc set, term set, max d − 1, method);
expr = (f unc, arg1, arg2, ) ;
return expr;
Notes: func set is a function set, term set is a terminal set, max d is the maximum allowed depth for expression trees, the method is either ”Full” or ”Grow” and expr
is the generated expression in its prefix notation
The Full and Grow tree generation methods could be combined in the commonly usedRamped half-and-half method Here half the initial population is constructed using theFull method and the other half is constructed using the Grow method This is done byusing a range of depth limits (hence termed as ”ramped”) to ensure that generated expres-sion trees have variety of sizes and shapes While these methods are easy to implementand use, they often make it difficult to control the statistical distributions of importantproperties such as the sizes and shapes of the generated trees Other initialisation mecha-nisms, however, have been developed (e.g., [62]) to allow better control of these properties
in instances if such control is important
Trang 261.1 GENETIC PROGRAMMING
Fig 1.4: Creation of a five node tree using the Grow initialisation method with a maximum
depth of 2 (t=time) A terminal is chosen at t = 2, causing the left branch of the root to
be terminated at that point even though the maximum depth has not been reached [57]
On the other hand, the initial population does not need to be completely random Ifsomething is known about the likely properties of the desired solution(s), the expressiontrees that have these characteristics can be used as seeds to generate the initial population.These seeding individuals (programs) can be created by humans based on the knowledge ofthe problem domain, or they can be the results of previous GP runs However, unwantedsituations may occur when the second generation comprise of mostly the offsprings of asingle or very small number of seeding individuals Diversity protection techniques, such asmulti-objective GP (e.g., [61]), demes [3], fitness sharing [24] and the use of multiple seedingtrees, might be used In any case, the diversity of the population should be supervised toensure that there are diverge number of trees in the initial population
1.1.1.3 Fitness and Selection
Like in most other EAs, genetic operators in GP are applied to individuals that are abilistically selected based on their fitness Fitness is the measure used by GP to indicatehow well a program has learned to predict the output(s) from the input(s) [6] The goal ofhaving a fitness evaluation is to give the feedback to learning algorithms For this purpose,the fitness function should be designed to give graded and continuous feedback about howwell a GP program (individual) perform on the training set With fitness based selec-
Trang 27where o i is the output of the i th in the training set, p i is the output from GP program p
on the i th example from the training set of n examples
Squared Error Fitness Function An alternate fitness function is the sum of the
squared differences between p i and o i, called the squared error:
1.1.1.4 Recombination
Where GP departs significantly from other EAs is in the implementation of its operators crossover and mutation The most commonly used form of crossover in GP is the subtreecrossover It is conducted in the following steps:
-• Choose two individuals as parents that is based on reproduction selection strategy.
Trang 281.1 GENETIC PROGRAMMING
• Select a random subtree in each parent The subtrees constituting terminals are
selected with lower probability than other subtrees
• Swap the selected subtrees between the two parents The resulting individuals are
the children
Crossover operation is illustrated in Figure 1.5
Fig 1.5: Example of subtree crossover.
The majority of the nodes of GP tree will be leaves Consequently, the uniform tion of crossover points leads to crossover operations frequently exchange only very smallamounts of genetic materials (i.e., small subtrees); many crossovers may in fact reduce
selec-to simply swapping two leaves Therefore, crossover points are usually not selected withuniform probability, Koza suggested the widely used approach of choosing functions 90%
of the time, while leaves are selected only 10% of the time While subtree crossover iscommonly used in GP, then there are some variants of this operator is defined in a num-ber of articles For example, one-point crossover [86, 4] works by selecting a commoncrossover point in the parent programs and then swapping the corresponding subtrees Incontext-preserving crossover [21], the crossover points are constrained to have the samecoordinates, like in one-point crossover Size-fair crossover and homologous crossover were
Trang 291.1 GENETIC PROGRAMMING
described in [56] In size-fair crossover the first crossover point is selected randomly like inthe standard subtree crossover Then, the size of the subtree to be excised from the firstparent is calculated This is used to constrain the choice of the second crossover point so
as to guarantee that the subtree excised from the second parent will not be ”unfairly” big.The homologous crossover operator works identically to the size-fair crossover operator up
to the last step Instead of randomly choosing between all the available subtrees in thesecond parent of the desired size in homologous crossover they deterministically choose theone closest to the subtree in the first parent
1.1.1.5 Mutation
The mutation operator is used to create a random change to an existing program tree Instandard point subtree mutation used in this thesis, an individual is selected for mutationusing fitness proportional selection A single function or terminal is selected at randomfrom among the set of function and terminals making up the original individual as the point
of mutation The mutation point, along with the subtree stemming from mutation point,
is then removed from the tree, and replaced with the new, randomly generated subtree.The new subtree being inserted into the tree to form a new individual is created using the
”grow” method of tree construction as was used to generate an entire tree described inthis section Figure 1.6 depicts the process Subtree mutation is sometimes implemented
as crossover between a program and a newly generated random program; this operation
is also known as ”headless chicken” crossover [1] Another alternative form of mutation ispoint mutation, which is the rough equivalent for GP of the bit-flip mutation used in GAs
In point mutation, a random node is selected and the primitive stored there is replacedwith a different random primitive of the same arity taken from the primitive set (functionsand terminals) If no other primitives with that arity exist, nothing happens to that node(but other nodes may still be mutated) Note that, when subtree mutation is applied, this
Trang 301.1 GENETIC PROGRAMMING
involves the modification of exactly one subtree Point mutation, on the other hand, istypically applied with a given mutation rate on a per-node basis, allowing multiple nodes
to be mutated independently
Fig 1.6: Example of subtree mutation [57]
There are a number of mutation operators which treat constants in GP programs asspecial cases [95] mutates constants by adding Gaussian noise to them However, othersuse a variety of potentially expensive optimization tools to try and fine tune an existingprogram by finding the ”best” values for constants within it E.g., [96] uses ”a numericalpartial gradient ascent to reach the nearest local optimum” to modify all constants in
a program, while [98] uses simulated annealing to stochastically update numerical valueswithin individuals
While mutation is not necessary for GP to solve many problems, [79] argues that, in somecases, GP with mutation alone can perform as well as GP using crossover While mutationwas rarely used in early GP work, it is more widely used in GP today, especially in modelingapplications
1.1.1.6 Reproduction
The reproduction operator copies an individual from one population into the next
Trang 311.1 GENETIC PROGRAMMING
1.1.1.7 Stopping Criterion
Traditionally, GP uses a generational approach Each generation is represented by a plete population of individuals The newer population is created from and then replacesthe older population Alternatively, steady-state approach could be used, in which there is
com-a continuous flow of individucom-als selecting, mcom-ating, com-and offspring producing The offspringreplace existing individuals in the same population A maximum number of generationsusually defines the stopping criterion in GP However, when it is possible to achieve anideal individual (with ideal fitness), this can also stop GP evolutionary process In thisthesis, some other criteria are proposed, such as a measure of generalization loss, a lack offitness improvement, or a run of generations with over-fitting
of words, containing a whole number of instructions are typically swapped over Similarly,mutation operations normally respect word boundaries and generate legal machine codes
Trang 321.1 GENETIC PROGRAMMING
However, linear GP lends itself to a variety of other genetic operators If the goal isexecution speed, then the evolved code should be machine code for a real computer ratherthan some higher level language or virtual-machine code Peter Nordin started this trend
by evolving machine codes for SUN computers [74] Ron Crepeau [14] targeted the Z80.Kwong Sak Leung’s linear GP [58] was firmly targeted at novel hardware but much ofthe GP development had to be run in simulation whilst the hardware itself was underdevelopment
a generalization of standard GP However, more complex representations can be used,which allow the exploration of a large space of possible programs including standard tree-like programs, logic networks, neural networks, recurrent transition networks and finitestate automata In PDGP, programs are manipulated by special crossover and mutationoperators which guarantee the syntactic correctness of the offspring For this reason PDGPsearch is very efficient PDGP programs can be executed in different ways, depending
on whether nodes with side effects are used or not In a system called PADO (ParallelAlgorithm Discovery and Orchestration), Teller and Veloso [107]used a combination of GPand linear discrimination to obtain parallel classification programs for signals and images.The programs in PADO are represented as graphs, although their semantics and executionstrategy are very different from those of PDGP
Trang 331.2 AN EXAMPLE OF PROBLEM DOMAIN
1.1.2.3 Grammar Based GP
In Grammar Based GP, each program is a derivation tree generated by a grammar; thegrammar, often context-free, defines a language whose terms are expressions corresponding
to those used in standard tree-based GP Grammars bring a number of benefits to GP, one
of the most important is that it helps to restrict the search space with prior knowledge fromusers This restriction is done via the crafting of grammar production rules GrammarBased GP has found a large and varied range of real-world applications, for example inecological modeling, medicine, finance, and design For more details on the history anddevelopments of this subfield, readers are referred to [68] and references therein
The domain of symbolic regression is one of the most common problem domain in GP Insymbolic regression input and output pairs are used to infer a functional model in symbolicform GP is asked to find a program that approximates a target (unknown) function A
target function f (x) = y is applied to domain values, x, in a pre-determined range The
resulting values are then compared with the candidate program’s value upon the samevalues This thesis uses the 10 polynomials in Table 1.1 with white (Gaussian) noise of
standard deviation 0.01 It means each target function is F (x) = f (x) + ϵ 4 more
real-world data sets are from the UCI machine learning repository [7] and StatLib [116] arealso used as shown in Table 1.2
Machine learning is programming computers to optimize a performance criterion usingexample data or past experience [69] We have a model defined up to some parameters,
Trang 34Tab 1.1: Test Functions
Data sets Features Size Source Concrete Compressive Strength 9 1030 UCI
in training, we need efficient algorithms to solve the optimization problem, as well as tostore and process the massive amount of data we generally have Second, once a model
is learned, its representation and algorithmic solution for inference needs to be efficient aswell In certain applications, the efficiency of the learning or inference algorithm, namely,its space and time complexity, may be as important as its predictive accuracy There-fore, there are several possible criteria for evaluating a learning algorithm as evaluatingperformance: predictive accuracy of classifier, speed of learner, speed of classifier, spacerequirements Therein, predictive accuracy is most common criterion Generalization is animportant issue in ML The generalization ability of learning machine depends on a balancebetween the information in the training examples, and the complexity of the models Bad
Trang 351.4 GP AND MACHINE LEARNING ISSUES
generalization occurs if
• Over-fitting of data: the information does not match the complexity - the training
data set is too small, or the complexity of the models is too large
• Under-fitting of data: the opposite situation.
GP is one of the only two ML techniques explicitly able to represent and learn relational(or first-order) knowledge (the other being Inductive Logic Programming (ILP)and itsextended variants) GP has been successfully applied in a wide range of ML problems
In spite of its success, GP remains a topic of open research and development, with manyaspects being still theoretically unexplainable, and many big questions for future answering.There are complex feasibility constraints (in most systems, not all genotypes are legal),and complex fitness dependencies (epistases) This section explores some open questions
in GP research from ML perspective, which motivates the research in this thesis
1.4.1 Search Space
GP is a search technique that explores the space of computer programs The search forsolutions to a problem starts from a group of points (random programs) in this search space.Those points that are above average quality are then used to generate a new generation
of points through crossover, mutation, reproduction and possibly other genetic operations.This process is repeated over and over again until a stopping criterion is satisfied Weshall show how the space of all possible programs, i.e the space which GP searches,scales Particularly how it changes with respect to the size of program We will show,
in general, that above some problem dependent threshold, considering all programs, their
Trang 361.4 GP AND MACHINE LEARNING ISSUES
fitness shows little variation with their size The distribution of fitness levels, particularlythe distribution of solutions, gives us directly the performance of random search In theimplementation of GP the maximum depth of a tree is not the only parameter that limitsthe search space of GP, an alternative could be the maximum number of nodes of anindividual or both (depth and size)
1.4.2 Bloat
In the course of evolution, the average size of the individual in the GP population often getincreased largely, even at some point it would start growing at a rapid pace Typically theincrease in program size is not accompanied by any corresponding increase in fitness Theorigin of this phenomenon, which is know as bloat [5, 63], has effectively been a subject ofresearch for over a decade In fact that, the increase in size of the individuals with eachgeneration is perfectly reasonable to be able to comply with all the fitness cases because
of the initialization randomly for the GP population So, bloat is not equal to growth
We should only talk of bloat when there is growth without (significant) return in terms
of fitness Because of its practical effects (large programs are hard to interpret, may havepoor generalization and are computationally expensive to evolve and later use), bloat hasbeen a subject of intense study in GP [22, 87, 71, 99, 106, 67], As a result, many theorieshave been proposed to explain bloat: replication accuracy theory, removal bias theory,nature of program search spaces theory, etc There is a great deal of confusion in the field
as to the reasons of (and the remedies for) bloat
1.4.3 Generalization and Complexity
Generalization is not any more important for genetic-based methods than it is for tional learning methods The power of learning lies in the fact that learning systems can
Trang 37conven-1.5 GENERALIZATION IN GP
extract implicit relationships that may exist in the inputoutput mapping or the interaction
of the learner with the environment Learning such an implicit relationship only duringthe training phases may not be sufficient A generalization of learned tasks or behaviorsmay be essential for several reasons The earlier work in GP focused on solving problems
on the training data set, without considering that the over-fiting on the training set willmake the solution worse in fitting future data Therefore, the generalization ability of GPhas been paid more attention in recent works [13, 33, 111] Most of research in improvinggeneralization ability of the GP has concentrated on avoiding over-fitting on training data.One of main cause of over-fitting has been identified as the ”complexity” of the hypothesisgenerated by the learning algorithm When GP trees grow up to fit or to specialize on ”dif-ficult” learning cases, it will no longer have the ability to generalize further, this is entirelyconsistent with the principle of Occam’s razor [52] stating that the simple solutions shouldalways be preferred The relationship between generalization and individual complexity in
GP has often been studied in the context of code bloat, which is extraordinary enlargement
of the complexity of solutions without increasing their fitness
Because of the aforementioned reasons, the following section will synthesize the niques used to improve the generalization ability of GP before going on to introducepromising technique can increase this ability of GP
GP has been proposed as a ML method and the core objective of a learner is to generalizefrom its experience The training examples from its experience come from some generallyunknown probability distribution and the learner has to extract from them somethingmore general, something about that distribution, that allows it to produce useful answers
Trang 381.5.1 Overview of Studies on GP Generalization Ability
In [54] generalization is defined as follows
Definition 1 Given a learner, and its performance after it is ”trained” on a subset of X,
if exhibits a satisfactory performance on another set of instances belonging to that does not contain instances used in the training phase, the learner is said to have generalized over this new set of instances of.
Although achieving high generalization ability is the main objective any learning chine, it was not seriously considered in the field of GP for a long time Before Kushchupublished his seminal paper on the generalization ability of GP [54], there was rather fewworks in the literature to deal with the GP generalization aspect Common to most of theresearch above are the problems with obtaining generalization in GP and the attempts toovercome these problems These approaches can be categorized as follows:
ma-• Using training and testing and in order to promote generalization in supervised
learn-ing problems
• Changing training instances from one generation to another and evaluation of
perfor-mance based on subsets of training instances are suggested, or consider to combine
GP with ensemble learning techniques
Trang 391.5 GENERALIZATION IN GP
• Changing the formal implementation of GP: representation GP, genetic operators
and selection strategy
1.5.2 Methods for Improving GP Generalization Ability
This section briefly overview some main approaches in the literature for improving thegeneralization ability of GP They are organized in the categories mentioned in the previoussubsection
1.5.2.1 Sampling Based Approaches
Sampling based approaches are data-driven methods for improving GP learning
In [35], the author proposed a simple sampling strategy for the training of GP It forcedGP’s attention into the hard fitness cases (samples), i.e the ones which are frequentlymis-classified and also involve fitness cases which have not been worked on for severalgenerations In each generation, a subset of training samples is selected by the followingtwo passes through the full training set In the first pass of the entire training set, of size
T , at a generation g, each training case, i, is assigned a weight, W i, which is the sum of
its current ”learning hardness”, D i , which is exponentiated to a certain power, d, and the number of generation since it was last selected (or age), A i, which also exponentiated to a
Trang 401.5 GENERALIZATION IN GP
target subset size, S:
∀i : 1 ≤ i ≤ T, P i (g) = ∑W i (g) ∗S
T j=1 W j (g) This scaled probability ensures that the average selected
subset size is the same with the size of the original training sample set If a case, i, is selected to be in the subset, then its hardness, D i , and age, A i, are set to 0, otherwise its
hardness remains unchanged, and its age, A i is incremented While testing each member
of the GP population against each fitness case in the current training subset, the hardness,
D i, (starting from 0) is incremented each time the fitness case is mis-classified by one ofthe GP trees The summary results showed that the GP with this scheme of dynamicsubset selection (DSS) produced results as good as those of the standard GP and in amuch shorter time
A good example of experimental attempts to increase the robustness of the programsevolved by GP is presented in the work of Moore and Garcia [70] The system proposed inthe paper evolves optimized manoeuvres for the pursuer/evader problem The results ofthis study suggest that using of a fixed set of training cases may result in brittle solutionsdue to the fact that such fixed fitness cases may not be representative of possible situationsthat the agent may encounter It was shown that the use of randomly generated fitnesscases at the end of each generation can reduce the brittleness of the solutions when testedagainst a set of large representative situations after the evolution
Byoung-Tak Zhang [122] introduced incremental data inheritance (IDI for short), whichevolves programs using program specific subsets of given data that also evolve incrementally
as generation goes on The basic idea of IDI method on GP is that programs and theirdata are evolved at the same time: each program is learned on separate data set Thus,IDI separates two populations: program population and data population The program
population A(g) consists of individuals A i (g) and the data population D(g) consists of data individuals D i (g), where g is the generation This method presented a Bayesian framework