Genetic programming GP is a powerful tool for system identification when little is known about the underlying model structure in the data.. generation: left a run with MDL fitness measur
Trang 1A C K N O W L E D G E M E N T S
I would like to express my appreciation to a number of people who have contributed, directly or indirectly, to this thesis First and foremost, I would like to express gratitude and appreciation to Assistant Professor Lakshminarayanan Samavedham, for his guidance, encouragement and support during this study His optimism, nourishing encouragement-words and understanding of one’s limit and potential were the keys toward the success of this research
I am very much indebted to the National University of Singapore for providing a Research Scholarship that made my studies possible at the Department of Chemical and Environmental Engineering Immeasurable thanks are given to my “Data Analysis and Control System Group” members for providing a congenial environment I would like to thank our group members particularly Madhukar, Prabhat, Dharmesh, Mranal and May Su Tun for their proof reading of this thesis and giving pleasurable time throughout the one and half years I am very grateful to Mr P Vijaysai of the Department of Chemical Engineering, IIT(Bombay), India who suggested the use of differential evolution optimization
Finally, I would like to dedicate this work to my parents, brothers and sisters, who brought me to this level and hence special thanks are due to them A special “thank you” is given to my beloved Chaw Su Thwin, for her understanding, encouragement and support
Trang 2T A B L E O F C O N T E N T S
Acknowledgements i
Summary vi Nomenclature viii
CHAPTER 1: INTRODUCTION
CHAPTER 2: THE BASICS OF GENETIC ALGORITHMS AND GENETIC
2.5 Shortcomings and limitations of genetic programming 19
Trang 44.5.2.1 Gauss-Newton optimization 49 4.5.2.2 Gauss-Newton optimization algorithm 51
5.3 Identification of implicit algebraic equation systems 77
5.4 Identification of discrete dynamic algebraic systems 81 Case study 4: Simulated nonlinear dynamical system 81 Case study 5: Experimental heat exchanger system 83 Case study 6: Modeling of an acid-base neutralization system 85 Case Study 7: Modeling of rainfall-runoff data 92 5.5 Identification of ordinary differential equation systems 95
Case study 9: Reversible first order series reaction 102
Trang 55.6 Determination of number of states 108 Case Study 11: Linear series reaction system 109 5.7 Integration of nonparametric regression techniques into DACS-GP 112
Trang 6S U M M A R Y
The objective of the present study is to develop system identification tools using genetic program paradigm The developed software is able to identify several types of models ranging from algebraic to differential equation system using process data Identification for state space model using genetic programming is a relatively new area of application which has been attempted here
Genetic programming (GP) is a powerful tool for system identification when little is known about the underlying model structure in the data The technique is attractive due to its ability in searching discontinuous complex nonlinear spaces GP works on a population of individuals, each of which represents a potential solution A population
of model structures evolves through many generations towards a solution using certain evolutionary operators and ‘survival-of-the-fittest’ selection scheme GP operators include reproduction, crossover and mutation
The developed program employs an unique approach in model representation which helps to develop faster and robust computer programs The representation is flexible
so that it can be applied to different domains of application without changing the main
GP algorithm Extensive literature on algebraic modeling using GP approach is available; however there have been little or no reported applications of GP in the context of state space models Many chemical processes could be more conveniently represented by a set of nonlinear differential-algebraic equations There is also no standard for identifying differential equation models from experimental data alone We have successfully identified nonlinear differential equation models for several batch reaction data We introduce a new concept to take advantage of process knowledge
Trang 7using user defined ‘evolution policy’ A new fitness measure that takes into account
of functional complexity of the model is also proposed We also propose several enhancements to improve efficiency of GP such as modified genetic operators, new block model representation using Simulink process simulator, distributed computing, integration of nonparametric techniques and implicit algebraic equation modeling The results of these are shown and promising improvements are recommended
The developed program was applied and its performance examined on a wide range of system identification tasks System identification studies are carried out on systems modeled by algebraic, dynamic algebraic and nonlinear state space equations Workability of this is illustrated by twelve case studies Simulated data sets and experimental data sets have been used in the case studies The developed program satisfactorily identified nonlinear dynamic systems for all the systems studied
The executable version of the program as well as example files and data are attached
at the back of the thesis It has easy, intuitive and interactive graphical user interface with online and offline analysis tools The program was written in MATLAB programming language
Trang 8N O M E N C L A T U R E
Abbreviation Explanation
ACE Alternating Conditional Expectation Analysis
B ‘B’ combining with variable name indicates sample lag
Eg yB1 is one sample lag output variable
DACS Data Analysis and Control System group
MARS Multivariate Adaptive Regression Splines
t time
u1, u2, … input variables
y1, y2, … output variables
Trang 9L I S T O F T A B L E S
Table 5.7 GP configuration script file for Case Study 4 82
Table 5.11 Nominal operating conditions for the acid-base
Table 5.12 Configuration for Case Study 6 (acid-base system) 90
Table 5.14 Configuration for rainfall runoff modeling (Case Study 7) 93
Table 5.16 Configuration details for Case Study 8 97
Table 5.18: Configuration details for Case Study 9 103
Trang 10Table 5.19 Configuration details for Case Study 10 106Table 5.20 Results of GP runs for the Lotka-Volterra system data (Case
Trang 11L I S T O F F I G U R E S
Figure 2.1 Flowsheet of Genetic Programming algorithm 11
Figure 3.2 Tree structure of a differential equation model 25
Figure 3.4 Average length of chromosome vs generation: (left) a run
with MDL fitness measure, (right) a run with propose fitness
measure
34
Figure 4.2 DACS-GP GUI: quick configuration windows 59
Figure 4.4 DACS-GP GUI: Genetic environment configuration 60
Figure 5.4 The fit obtained with the identified model for Case Study 2 78Figure 5.5 Model fit for the ellipse data (Case Study 3) 80Figure 5.6 Model fit and prediction of the “best” GP model for
Figure 5.7 Schematic of the acid-base neutralization system 87
Figure 5.9 Predictions obtained with Hammerstein model 90Figure 5.10 Plot of model vs data (x-axis: Samples, y-axis: pH) 92
Trang 12Figure 5.12 Performance of GP 98Figure 5.13 Crossover of individuals from generation 4 to create the
Figure 5.14 Crossover of individuals from generation 9 to create the
Figure 5.15 Crossover of individuals from generation 13 to create the
Figure 5.16 Plot of best model vs data from Run 1 (Case Study 9) 104Figure 5.17 Plot of best model vs data from Run 2 (Case Study 9) 104Figure 5.18 Plot of best model vs data from Run 3 (Case Study 9) 105Figure 5.19 Plot of best model vs data for the Lotka-Volterra System
Figure 5.20 Plot of best model vs data for the linear series system
Figure 5.21 Plot of best model vs data for the linear series system
Figure 5.22 Plot of best model vs data for the linear series system 112
Figure 5.24 Data vs model prediction for Case Study 12 118
Trang 14
provides a family of comparable solutions from which the informed user can select a model that is commensurate with the physics/chemistry of the system as well as the intended end use The GP method is computationally intensive but can come quite handy in situations where little or nothing is known about the structure of the process model
The details of the GP method will be provided in the following chapter Presently, we will provide the range of applications that GP has been used for in chemical engineering and other domains Iba et al., (1993, 1994) use GP for system identification based on data from static and dynamical simulated systems (interestingly, in their 1993 paper, they do not use the term “genetic programming” They refer to their method as “a variant of genetic algorithm which permits a GA to use structured representations”) Their method (STROGANOFF) is implemented in the LISP language and uses second order polynomials as the basic building block Gray et al., (1996) discuss the application of genetic programming to nonlinear modeling To the best of the authors’ knowledge, the first application of GP to chemical engineering problems was made by McKay et al., (1997) and Willis et al., (1997) McKay et al (1997) apply GP to the steady state modeling of chemical processes and consider a tutorial example followed by applications to two typical processes - a vacuum distillation column and a chemical reactor system Willis et al., (1997) extend the GP methodology to the development of dynamic input-output models and demonstrate its workability by the development of inferential estimation models for a vacuum distillation column and a twin screw cooking extruder Marenbach (1998) use GP for the construction and refinement of models for a simulated bioprocess This work employs block diagrams (such as those available in SIMULINK® under MATLAB®) for representing the models – this is in contrast to
Trang 15other works that are equation oriented Greeff and Aldrich (1998) illustrate GP-based empirical modeling by means of three examples, two of which are based on data pertaining to leaching experiments Lakshminarayanan et al (2000) use a composite GP-PCA (genetic programming combined with a multivariate statistical technique called principal components analysis (PCA)) approach to generate nonlinear models for industrial product design applications Gao and Loney (2001) combine GP with neural net to evolve a polymorphic neural network and apply it to predicting pH in a simulated CSTR Grosman and Lewin (2002) employ GP for nonlinear model predictive control (NMPC) In their work, the nonlinear model predictive control uses predictions provided by the GP generated model and this is shown to improve the control performance on two multivariable simulated processes: a mixing tank and a Karr liquid-liquid extraction column Tang and Li (2002) demonstrate the use of GP
by using it in conjunction with partial least squares (PLS) and GA for quantitative structure-activity relationship (QSAR) studies Davidson et al., (2003) propose a hybrid approach to construct polynomial regression models They demonstrate their method using two simulation examples and through modeling of rainfall runoff data
in the Kirkton catchment in Scotland While their method takes care of some of the drawbacks encountered in “standard” GP implementations (such as the ephemeral
random constants, code bloat (presence of non-functional code within mathematical
expressions) and hidden complexity (for example, k*x (involving one parameter) written as (k1*x) + (k2*x) + … + (k5*x) (which involves five parameters))), their method has the limitation of restricting the mathematical operators to addition, multiplication and non-negative integer powers Hong and Rao (2003) use GP to model the dynamics of a municipal activated-sludge wastewater treatment plant Ashour et al (2003) employ GP to derive a complicated nonlinear relationship
Trang 16relating the various input parameters associated with reinforced concrete deep beams and their ultimate shear strength Swain and Morris (2003) apply GP to obtain models for single and two-link terrestrial manipulator systems Kaboudan (2003) provides an insight into the forecasting ability of GP generated models by employing simulation examples and real-world sunspot numbers GP has also found use in extracting comprehensible knowledge from hepatitis and breast cancer datasets (Tan et al., 2003) While the above list is by no means exhaustive, they do point to the versatile nature of the GP methodology as being applicable to a wide variety of problems and also to the fact that GP has been receiving a lot of attention lately It is important to point out that there has been a very little amount of theory developed to explain genetic programming Even while this is the case, there have been startling successes
in the empirical sense such as the development of circuit description programs and rediscovering several patented circuits (Koza and coworkers, 1999a,b)
The primary goal of this thesis is to develop a MATLAB based GP modeling tool that
is capable of handling data from static and dynamical systems Interesting aspects of this tool include the automatic generation of nonlinear state space models from time series data and the incorporation of powerful nonparametric modeling tools such as Alternating Conditional Expectation (ACE) (Breiman and Friedman, 1985) and Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991; DeVeaux et al., 1993) for initial screening of variables and models Such initial screening helps to reduce the computational burden of the GP methodology as will be demonstrated in chapter 5
Trang 171.2 Outline of the Thesis
This thesis is organized as follows Chapter 2 reviews the genetic programming literature, presents the algorithm, discusses the fundamental theory and concludes in a balanced fashion by pointing to the limitations of the GP methodology Chapter 3 discusses the formulation of GP to system identification applications It also introduces novel evolution policies, discusses improved genetic operators and formulates a new fitness measure Chapter 4 exclusively focuses on the MATLAB based DACS-GP program, which was developed from scratch during the course of this research This chapter also discusses important numerical techniques such as optimization, computation of analytical gradient and other key implementation issues (e.g suitable data structure for genes) The workability of the developed tools and methods is vividly illustrated in chapter 5 using several case studies that includes algebraic modeling, dynamic discrete system modeling and continuous nonlinear differential equation modeling The data sets employed in these studies include data from simulated systems as well as from real world experimental systems These data sets are used to assess the performance of the developed computer code Finally, chapter 6 summarizes and discusses the results presented in this thesis, along with some suggestions for future research A CD-ROM that includes an executable version
of the program and example files is also attached to this thesis
Trang 18C h a p t e r 2
THE BASICS OF GENETIC ALGORITHMS AND GENETIC
PROGRAMMING
Listening and noting well enriches knowledge
Knowledge enhances progress
– Loka Niti
2.1 Introduction
The general task of a system identification problem is to approximate the input-output behavior of an unknown process system using an appropriate model Appropriate modeling components must be selected to ensure a model that can accurately reproduce the behavior of the plant If sufficient knowledge of the system is available then one could write down the system of governing equations (algebraic / ordinary differential equations / partial differential equations) containing unknown parameters that can be estimated from experimental process data Often times, the physical and chemical phenomena governing processes are unknown This scenario leads to the employment of system identification tools where a model structure is assumed and the parameters of this model are estimated from the available data using suitable techniques The parameters of this “black box” model may or may not have a direct relationship to the physical parameters of a first principles model This difference between the “parameters” of a first principles model and an empirical (data-based) model must be understood The empirical modeling exercise becomes more
Trang 19challenging and intriguing if the model structure itself is unknown This is precisely the domain where methods such as neural networks and genetic programming have established a niche for themselves However, neural networks have the drawback of being “opaque” while GP has the advantage of being transparent (in that it can provide explicit mathematical equations for the model) and having the ability to provide a family of models rather than a single model The latter feature can help the modeler to choose a physically meaningful model from a population of valid models
In GP, the initial population of models is modified through the evolution process possibly leading to a population of “optimal” process models Numerical parameters that appear in each of the models are estimated in order to minimize the error between the actual output and the model predictions The evolution of a population over time
in GP is based on simple concepts of natural selection and evolution: “survival of the fittest” and “reproduction” At each “generation”, the individuals are evaluated, and the fitter ones are selected to produce offspring in an attempt to form “better” individuals for the next generation
In this chapter, we first review the topic of genetic algorithm (GA) since GP is based
on it This review is followed by an introduction to genetic programming wherein a common form of GP algorithm is explained in detail The chapter ends by pointing out the limitations or shortcomings of the genetic programming methodology
2.2 Overview of Genetic Algorithms (GA)
John Holland's pioneering book “Adaptation in Natural and Artificial Systems” (Holland, 1975) showed how the evolutionary process may be applied effectively to solve a wide variety of problems using a highly parallel technique that is now called the
Trang 20genetic algorithm The genetic algorithm imitates Darwinian principle of reproduction and survival of the fittest and naturally occurring genetic operations such as crossover (recombination) and mutation to solve a suitably posed mathematical problem Genetic algorithm is formulated by representing the solution as individuals in a population that evolves through several generations Each individual in the population represents a possible solution to the given problem The genetic algorithm attempts to find a very good or best solution to the problem by genetically breeding the population of individuals
In preparing to use the conventional genetic algorithm operating on fixed-length character strings to solve a problem, the user must:
1 determine the representation scheme,
2 determine the fitness measure,
3 determine the parameters and variables for controlling the algorithm, and
4 determine a way of designating the result and a criterion for terminating the run
In the conventional genetic algorithm, the individuals in the population are usually fixed-length character strings patterned after the biological chromosome The most important part of the representation scheme is the mapping that expresses each possible point in the search space of the problem as a fixed-length character string (i.e., as a chromosome) and each chromosome as a point in the search space of the problem Selecting a representation scheme that facilitates solution of the problem by the genetic algorithm often requires considerable insight into the problem and good judgment The
Trang 21evolutionary process is driven by the fitness measure The fitness measure assigns a fitness value to each individual in the population
Advent of genetic algorithm techniques open a lot of optimization applications that is not possible and never done before Genetic algorithm evolved since then and still innovative ideas are in progress One examples of such innovation is Nondominated Sorting Genetic Algorithm (NSGA) as developed by Srinivas and Deb (1995) Yee et al (2003) described the multiobjective optimization of an industrial styrene reactor using NSGA,and provide a broad range of quantitative results useful for understanding and optimizing industrial styrene production The algorithm differs from the traditional genetic algorithm in the way the selection operator works In the NSGA, prospective solutions are sorted into fronts which are imaginary enclosures within which all chromosomes are mutually nondominating Such fronts are ranked progressively until all the chromosomes are accounted for Each chromosome is then assigned a fitness value obtained by sharing a dummy fitness value of the front by its niche count, a parameter proportional to the number of chromosomes in its neighborhood within the same front This helps spread out the chromosomes while maintaining the diversity of the gene pool
2.3 Genetic Programming
Genetic programming is an extension of the conventional genetic algorithm in which each individual in the population is a computer program The most important characteristic of the GP is the use of tree structure representation scheme Koza (1992) criticized the limitation of genetic algorithm representation by noting that the
“representation schemes based on fixed-length character strings do not provide any convenient way of representing arbitrary computational procedures or of incorporating
Trang 22iteration or recursion when these capabilities are desirable or necessary to solve a problem Moreover, such representation schemes do not have dynamic variability The initial selection of string length limits in advance the number of internal states of the system and limits what the system can learn”
Figure 2.1 is a flowchart for genetic programming paradigm The flowchart contains a loop executing multiple independent runs of genetic programming The important parameters to be initialized are:number of runs, number of generations, population size, probability of genetic operations and specification for termination criterion
Genetic programming starts with an initial population of randomly generated computer programs composed of functions and terminals appropriate to the problem domain The functions may be standard arithmetic operations, programming operations, mathematical functions or domain-specific functions Each individual computer program in the population is measured in terms of how well it conforms to the objective
of the problem at hand This measure is called fitness measure In system identification applications, the model serves as a program and is simulated using a simulation engine The simulation engine may be an algebraic evaluator, differential equation solver or block diagram based simulator like SIMULINK The resulting simulated output data for
a given input is compared with actual output data The parameter in the model is optimized or fine tuned before embarking on fitness measure
Unless the problem is very small and simple enough to be easily solved by blind random search, the individuals in generation zero will have exceedingly poor fitness Nonetheless, some individuals in the population will turn out to be somewhat fitter than others These differences in performance (fitness) are then exploited
Trang 23The Darwinian principles of reproduction and survival of the fittest are used to create a new offspring population of individual computer programs from current population of programs The reproduction operation may involve simple replication or sexual reproduction Replication involves selecting, in proportion to fitness, a computer program from the current population of programs, and allowing it to survive by copying
it into the new population
Initialize Start
parameters
Figure 2.1 Flowsheet of Genetic Programming algorithm
Run = nRun?
Create initial random population and compute fitness measure
Termination criterion safisfied ?
End
Setup population pool, send old population to graveyard
Select Genetic Operation Reproduction
Run = Run + 1
Yes No
Record
population
and clear
Yes
Trang 24The genetic process of sexual reproduction (crossover) between two parent computer programs (selected in proportion to fitness) is used to create new offspring computer programs The parent programs are typically of different sizes and shapes The offspring programs are composed of sub-expressions form their parents These offspring programs are typically of different size and shapes than their parents Sometimes a parent program simply mutates (by transforming a certain segment of the program) to result in an offspring After these operations are performed on the current population, the population of offspring (i.e., new generation) replaces the old population (i.e., old generation) Each individual in the new population is then measured for fitness (the guiding principle of natural selection and evolution) and the process is repeated over many generations Typically, the best individual that appears in any generation of a run (i.e., the “best-so-far” individual) is designated as the result produced by genetic programming
Koza (1992) summarized genetic programming by the following steps:
(1) Generate an initial population of random compositions of the functions and terminals of the problem (computer programs)
(2) Iteratively perform the following sub-steps until the termination criterion has been satisfied:
a Execute each program in the population and assign it a fitness value according to how well it solves the problem
b Create a new population of computer programs by applying the following two primary operations The operations are applied to
Trang 25computer program(s) in the population chosen with a probability based
on fitness
(i) Copy existing computer programs to the new population
(ii) Create new computer programs by genetically recombining randomly chosen parts of two existing programs
(3) The best computer program that appeared in any generation (i.e., the far” individual) is designated as the result of genetic programming This result may be a solution (or an approximate solution) to the problem
“best-so-2.3.1 Initializing a GP population
The first step actually performing a GP run is to initialize the population The initialization of a tree structure is fairly straightforward Functional gene and terminal gene are selected at random from the predetermined sets (gene library) and assembled a
tree There are two methods of initializing tree structure: full and grow (Koza, 2002)
Grow method produces trees of irregular shape because nodes are selected randomly from the function and the terminal sets Full method chooses only functions at first and finishes with terminals The result is that every branch of the tree goes to the maximum depth Diversity in the initial population is achieved using both of these methods called
“ramped-half-and-half” technique
2.3.2 Genetic operator
While the initial population usually has very low fitness, the average fitness invariably increases through generations The genetic operators provide a new position in the search space with heuristically better direction This is similar to role of directional
Trang 26derivative in other optimization methods Among many varieties of genetic operators the three common GP genetic operators are: 1) Reproduction, 2) Mutation and 3) Crossover
Generally, all operations involve: 1) selection of parent(s), 2) selection of operation point(s) and 3) genetic operation Varieties of selection method are discussed in following section Genetic operation creates offspring which is better effective than
randomly assembled individual on average
2.3.2.2 Crossover
The crossover (sexual recombination) operation for genetic programming creates
variation in the population by producing new offspring that consist of parts taken from each parent The crossover operator combines the genetic material of two parents by swapping a part of one parent with a part of the other Figure 2.2 shows the two parents and their offspring that is the result of the crossover operation
Trang 27The detailed operation of crossover is as follows:
• Choose two candidate parents using a selection method Generally the two
selection operation is independent and incestuous selection may be prohibited
• Randomly select a crossover point in each parent The selected crossover point and its group are highlighted with dark lines in Figure 2.2
• Swap the selected groups between the two parents giving two offspring The
“children” are shown under parents in Figure 2.2
Figure 2.2 Crossover operation
2.3.2.3 Mutation
The mutation operation introduces random changes in the gene of one individual in the population Mutation operation can be beneficial in reintroducing diversity in a population that may be tending towards premature convergence Mutation operation is a
Trang 28relatively unimportant secondary operation in the genetic programming practice and is sparingly employed (probability of mutation in the population is set at less than 1 %) However, in system identification applications, the probability of mutation should be high because the correlation of model structure to the size of chromosome is small compared to other applications In this study, the probability of mutation is set in the range of 10% to 30 % Figure 2.3 provides an example of the mutation operation The details of this genetic operation are:
• Choose a candidate parent using selection method
• Randomly select a gene in the parent
• The selected gene is deleted and substituted with a randomly generated gene It may be necessary to add more or remove some genes when rarity of deleted gene and the generated gene are different
Figure 2.3 Mutation operation
Trang 292.3.3 Selection method
All genetic operators operate on candidate parents chosen by one of the several available selection methods The selection procedure has a significant effect on the speed of evolution and is often cited as the culprit in cases where premature convergence occurs The common selection methods are described next
2.3.3.1 Fitness-proportional selection
As the name implies, fitness-proportional (or roulette wheel selection) selection specifies probabilities for individuals to be given a chance to pass on offspring into the next generation in proportion with its fitness If the fitness of an individual is high, it has a greater chance of being chosen as a parent For individual ‘i’ with fitness fi, this probability pi is defined as
∑
=
= n
j j
i i
f
f p
1
(2.1)
Holland (1975) and Koza (1992) used the fitness-proportional selection quite heavily This selection procedure has been criticized in recent times for assigning probabilities based on the absolute values of fitness For example, the fitness of an individual in a system identification application can be infinite if sum of squared error is used as the fitness measure The individual may present an “unstable” model giving infinite output values A large negative value for the fitness can still result in a higher probability pibeing assigned if fitness-proportional selection is employed without proper care Even
if fitness score is not infinite, a large deviation of some individuals can cause problems with this selection method
Trang 302.3.3.2 Tournament selection
Tournament selection is not based on competition among the full generation but in a subset of the population A number of individuals, called the tournament size, are selected randomly, and a best fitness score individual within the group is singled out The smallest possible tournament size is two The tournament size allows the modeler
to adjust the selection pressure A small tournament size causes a low selection pressure, and a large tournament size causes high pressure In this work, tournament selection has been the selection method of choice unless otherwise stated
2.4 Genetic programming theory
The schema theorem of Holland (1975) is one of the most influential and debated
theorems in evolutionary algorithms in general and genetic algorithms in particular The schema theorem addresses the central question why these algorithms work robustly in such a broad range of domains Essentially, the schema theorem for fixed length genetic algorithms states that good schemata (partial building blocks that tend to assist in solving the problem) will tend to multiply exponentially in the population as the genetic search progresses and will thereby be combined into good overall solutions with other such schemata Thus, it is argued, fixed length genetic algorithms will devote most of their search to areas of the search space that contain promising partial solutions to the problem at hand
There have been several attempts to transfer the schema theorem from genetic algorithms to GP (Langdon and Poli, 2002) However, the GP case is much more complex because GP uses representations of varying length and allows genetic material
to move from one place to another in the genome The crucial issue in the schema theorem is the extent to which crossover tends to disrupt or to preserve good schemata
Trang 31All of the theoretical and empirical analyses of the crossover operator depend, in one way or another, on this balance between disruptions and preservation of schemata
As the applications of GP diverge and are tailored to specific areas, the development of
GP becomes less of a scientific and more of an intuitive pursuit The fundamental working theory is important if one wishes to understand and make definitive improvements to the existing body of knowledge In this research work, there is no conscious effort to contribute to the theory of GP; the intent is more on developing a working GP code aided by other nonparametric data screening methods to solve problems of interest to chemical engineers
2.5 Shortcomings and limitations of genetic programming
Genetic programming works on a large population of solutions, the amount of computational time is consequently large In some problem we had solved, a run took more than three days and it is unacceptable in most case Furthermore, there is no guarantee the solution is the global optimum Further several independent runs are required to verify it Expensive computational time requirement is an inherent problem of genetic programming and hopefully to be solved as computational power
of computers increase
Success of genetic programming requires good fitness function that guides the evolution In some cases, formulating a fitness function is as difficult as solving the problem
With lack of strong theory how genetic programming work, there is no way of definitive improvement Current improvements on genetic programming are mostly
Trang 32based on one’s experience or intuition rather than on sound theoretical background There are a lot of fine tuning to solve a particular problem, but these are less helpful
in general
2.6 Summary
This chapter reviews the concept of genetic programming and its basic genetic algorithm is explained Genetic programming works on a population of individuals, each of which represents a potential solution A population of model structures represented as a tree evolves through many generations towards a solution using certain evolutionary operators and a ‘survival-of-the-fittest’ selection scheme Each essential step in genetic programming: genetic operators, fitness functions and selection methods are explained in detail Finally, genetic programming theory was presented and limitations of genetic programming were pinpointed
Trang 33C h a p t e r 3
APPLICATION OF GENETIC PROGRAMING TO
SYSTEM IDENTIFICATION
Never think of knowledge and wisdom as little
Seek it and store it in the mind Note that ant-hills are built with small particles of dust, and incessantly-falling rain drops when collected can fill a big pot
– Loka Niti
3.1 Introduction
Ljung (1989) describes the procedure for the identification of an appropriate model for
a process from experimental data System identification is an iterative procedure (see Figure 3.1) of model selection from a set of candidate models guided by prior information and the outcomes of previous modeling attempts The philosophical similarity between this and the GP methodology is worth noticing The following discussion compares the standard system identification procedure and the GP algorithm
In standard identification procedure, a set of initial candidate models is obtained by a priori knowledge and engineering intuition In GP, the models are assembled almost
randomly (as long as the number of arguments of function or operator respects the constraints imposed) One can therefore expect the quality of the initial model set or initial population in GP to be drastically inferior to that of intuitively proposed models
Trang 34knowledge (obtained from physical insight or through data preprocessing) can also be incorporated into GP modeling as well This is illustrated through the use of nonparametric modeling tools in chapter 5
Experiment Design
Data
Calculate Model
Choose Model Set
Choose Criterion of Fit
Validate Model
Prior Knowledge
Ok: Use it
Not Ok:
Revise Calculate parameters
Figure 3.1 The System identification procedure
Once the model structure is determined, the parameters of the model are estimated using an appropriate method (linear or nonlinear regression is usually involved) This is followed by assessment of model quality It is typically based on how the models perform when they attempt to predict newly measured data (Box and Jenkins, 1994) Well-grounded statistical measures such as Akaike Information Criterion (AIC) (Ljung 1999), Rissanen's Minimum Description Length (MDL) etc are employed to discriminate or choose between models In GP, the fitness measure of individual models is used for discriminating between models The problem with GP models is that the model complexity may vary – we need to compare models of different types (e.g
Trang 35linear versus nonlinear, various types of nonlinearity) In GP, the comparison between models is not straightforward and must be performed very carefully This issue is pursued in greater detail in section 3.5
Let’s consider how system identification proceeds The modeler may start with several possible models and see how it performs on a validation data set After knowing their performance, he/she may discard some of the bad models and consider the remaining models carefully In the next iteration of the system identification procedure, the modeler exploits the characteristics of the previous set of models to obtain a new set of models that are normally higher in quality compared to those in the previous iteration Some of the “new models” may belong to the previous set of models or be “tweaked” variants of those models or even be a combination of the previously obtained models This is exactly what happens when GP based procedure moves from one generation to the next In conventional system identification procedure, the modeler is in a position to look at a variety of fitness measures before converging on a model; the GP program usually has to contend with one fitness criterion (however carefully it may be constructed) Furthermore, an experienced modeler is capable of improving a model in
a very effective way (using physical or engineering insights) while a GP program may not be able to achieve this as easily It must also be very hard for a GP system to generate models without any idea of what a given argument or function could mean to the output These shortcomings could be at least partially addressed by employing efficient genetic operators as we demonstrate in the next chapter In this chapter, our primary focus will be on developing an efficient genetic programming system tailored for system identification tasks
Trang 36k s
k +
The representation of algebraic equations is rather straightforward To represent a differential equation system, we can extend the traditional tree representation For example, the “code”
plus (z1, times(z1, z2), z3)
plus(z2, sqr(z1))
or the “trees” as in Figure 3.2 represents the differential equation
Trang 37+ +
Figure 3.2 Tree structure of a differential equation model
3 2 1 1 1
z z z z dt
dz
++
=
2 1 2 2
z z dt
3.3 Parameterization
There are several difficulties with symbolic regression based on genetic programming The original technique makes use of non-adjustable constants, referred to as ephemeral random constants The constants do not necessarily assume optimal values The use of non-adjustable constants contributes to the complexity of expressions as well as to their inaccuracy Previous research has proposed methods to improve on ephemeral random constants Iba et al (1993) included two adjustable parameters in each expression and Watson and Parmee (1996) used micro-evolution to evolve constant values within expressions
Trang 38In our work, before calculating fitness value, the parameters in the GP assembled model are optimized so that the model prediction is an optimal fit to the measured output This technique has advantages because 1) optimal model is obtained 2) computation time is reduced since GP has to only find out model structure and 3) proven optimization routines can be employed Obtaining global parameters of any given model structure is
a challenging issue and is discussed in chapter 4 Depending on modeling domain, optimization procedures are changed and the use of couple of successive optimization routines is not uncommon to ensure global optimum or our confidence in it Where the parameters to be added are determined by GP system, however generally parameters are redundant and it is necessary to prune the parameters It was achieved by user defined evolution policy as described in section 3.6
3.4 Improved genetic operators
GA with its linear representation traditionally uses reproduction, crossover and mutation operators exclusively GP also employs these operators; in addition, its tree based representation and diverse domain interpretation permits the use of more effective genetic operators
3.4.1 Superposition crossover
Crossover is a very important genetic operator - Koza (1992) argues that crossover works by imitating natural reproduction and plays the same role in GP as it does in GA The standard GP crossover operates by randomly selecting a crossover point in each parent tree and then swapping the subtrees attached to such nodes to obtain the offspring Crossover is an effective genetic operator because the “branch” has some characteristic feature of the whole “true” chromosome By regrouping the branches, a better chromosome could be obtained While successful applications have been
Trang 39reported using the standard GP crossover, the limitations of this approach have been identified by several investigators (e.g., Lang, 1995) Several researchers have emphasized that the crossover operator might be improved by adding intelligence to the crossover operator by letting it select the crossover points in a way that is less destructive to offspring This is referred to as superposition crossover
In the process of this regrouping by crossover, the properties of chromosome are broken
if the “branch groups” are not selected properly For example, if two parents
1
u u
probably a much better offspring than the offspring
3 2 3 1
11
u u
operation might generate We could exploit this idea by specifying that crossover selection points must occur at the plus gene only
3.4.2 Adaptation
Adaptation is a modified form of mutation Mutation is a sparsely used genetic operator due to its adverse effects This is because randomly changing a gene is more likely to turn the chromosome into a worse chromosome than making it better However, the mutation operation can be used beneficially if the change is not introduced randomly but introduced in a closely related gene For example, if the gene changes into by mutation, this is likely to result in a better individual This kind of mutation is called adaptation and is employed in this work
)2(
Trang 403.5 Fitness measure
Fitness is a numeric value assigned to each member of the population to provide a measure of the appropriateness of the individual as a solution to the problem in question The definition of fitness measure has a direct and significant bearing on the resulting solution Fitness measure is defined such that the most suited model will have the highest (or lowest) score whereas the poorest model will have the lowest (or highest) score and all other models lie in the continuum between these extremes
The goal of having a fitness evaluation is to give continuous feedback to the evolutionary algorithm regarding which individuals should have a higher probability of being allowed to multiply and reproduce and which individuals should have a higher probability of being removed form the population The fitness function is calculated on the validation data set or the combination of both training and validation data sets
We could use well-known model selection criteria in system identification such as Minimum Description Length (MDL) or Akaike's Information Criterion (AIC) Generally, such criteria includes two terms: (a) a term that accounts for mismatch between the experimental data and model predictions and (b) penalty for number of parameters The first term generally decreases with increasing model complexity and the second term increases with increasing model complexity A good fitness measure
has balance tradeoff between the two parts Iba et al., (1993) used MDL as the fitness
measure MDL is defined as
N p S
N MDL=0.5 log N2 +0.5 log (3.1)