... and constraints, the searching of these probes using traditional searching methods is computationally intensive Our approach is to use make use of computational intelligence techniques, in this... target In a probe, if one section of the ORF is 31 E n d -p o in t of s u b -s e q u e nc e Chapter Finding Probes of Yeast Genome using ES Start-point of sub-sequence Figure 3.2 The spread of the... optimization problems, it should incorporate the information of constraint violation into the fitness value because all information of the quality of an individual is determined by its fitness value
Trang 1Application of Computational Intelligence in
Biological Sciences
Xu Huan (B.ENG.) DEPARTMENT OF ELECTRICAL ENGINEERING
A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2003
Trang 2I would like to express my deepest gratitude to my supervisor, Dr Arthur Tay
of ECE department and Dr Ng Huck Hui from Genome Institute of Singaporefor their guidance through my M.E study Without their gracious encouragementand generous guidance, I would not be able to finish my work Their unwaveringconfidence and patience have aided me tremendously Their wealth of knowledgeand accurate foresight have greatly impressed and benefited me I am indebted
to them for their care and advice not only in my academic research but also in
my daily life I would like to extend special thanks to Dr Dong Zhaoyang ofUniversity of Queensland for his comments, advice, and inspiration
Special gratitude goes to my friends and colleagues I would like to express mythanks to Mr Yang Yongsheng, Mr Zhou Hanqing, Mr Ge Pei, Mr Lu Xiangand many others in the Advanced Control Technology Lab I enjoyed very muchthe time spent with them I also appreciate the National University of Singaporefor the research facilities and scholarship
Finally, this thesis would not have been possible without the support from myfamily The encouragement from my parents has been invaluable My wife, WangLei, is the one who deserves my deepest appreciation I would like to dedicate thisthesis to them and hope that they would enjoy it
Xu HuanApril, 2003
i
Trang 3Acknowledgments i
1.1 Motivation 1
1.2 Contribution 4
1.3 Thesis Organization 5
2 Evolutionary Computation 6 2.1 Basic Principle of Evolutionary Computation 6
2.1.1 Selection 7
2.1.2 Mutation and Crossover 10
2.2 Variants of Evolutionary Computation 12
2.2.1 Evolutionary Strategy 12
2.2.2 Genetic Algorithm 13
2.2.3 Evolutionary Programming 14
2.2.4 Genetic Programming 15
2.3 Advantage and Disadvantage of Evolutionary Computation 16
2.4 Constrain Handling 17
2.5 Premature Convergence Avoidance 20
ii
Trang 43 Finding Probes of Yeast Genome using ES 24
3.1 Introduction 24
3.2 Criteria of the probe search 26
3.2.1 Uniqueness criteria 28
3.2.2 Melting temperature criteria 30
3.2.3 Non folding-back criteria 30
3.3 Evolution strategies, constraints and genetic diversity: the algorithm 32 3.3.1 Encoding Scheme 34
3.3.2 Fitness function design and constraint handling 35
3.3.3 Premature Convergence and Fitness Sharing 39
3.4 Simulation Results and Discussions 42
3.5 Conclusions 45
4 Finding Probes of Human Chromosome 12 using ES and BLAST 54 4.1 Introduction 54
4.2 First Exon Prediction 55
4.3 Local Alignment and BLAST method 57
4.4 Criteria of Probe search 59
4.4.1 Uniqueness criteria 60
4.4.2 Melting temperature criteria 62
4.4.3 Non folding-back criteria 63
4.5 Evolutionary Strategies 65
4.5.1 Encoding Scheme 66
4.5.2 Fitness function design 67
4.6 Simulation Results and Discussion 68
4.7 Conclusion 72
5 Conclusion 73 5.1 Main Findings 73
5.2 Suggestion for Future Work 74
Trang 5Author’s Publications 82
Trang 63.1 The spread of the uniqueness function, funi 29
3.2 The spread of the melting temperature function, ftem 31
3.3 illustration of non-folding criteria 32
3.4 The spread of the non-folding back function, fnf b 33
3.5 Illustration of the Incremental Penalty Function 38
3.6 Illustration of the incremental penalty function used in probe search 48 3.7 The comparison of population spread of sharing (left) and no-sharing (right) methods 49
3.8 A typical fitness curve for genome whose probes has been found, without niching method 50
3.9 A typical fitness curve for genome whose probes has been found, with niching method 50
3.10 A typical fitness curve for genome whose probes has been found, without niching method 51
3.11 A typical fitness curve for genome whose probes has been found, with niching method 51
3.12 The melting temperature of all found probes 52
3.13 The length of all found probes 52
3.14 Examples of locations of probes found 53
4.1 illustration of DNA transcription 56
4.2 sample of output of BLAST test 59
4.3 sample of feasible region of uniqueness criteria (shadowed region feasible) 61
v
Trang 74.4 The feasible region of melting temperature criteria.(shadowed regionfeasible) 634.5 The feasible region of non-folding criteria (shadowed region feasible) 644.6 The length of found probes using enumeration and using ES 704.7 Location of found probes 72
Trang 83.1 Computation time using ES with sharing 42
3.2 Comparison of number of probes that cannot be found 43
3.3 Table of ∆S 46
3.4 Table of ∆R 47
4.1 ES vs Enumeration 69
4.2 BLAST vs non BLAST 70
4.3 All exon vs entire chromosome 71
vii
Trang 9DNA microarray is an important tools in genome research To conduct a DNAmicroarray test, a set of pre-defined probe is essential A qualified probe shouldsatisfy three criteria, namely, uniqueness criteria, melting-temperature criteria and
no self-folding criteria Traditional method regarding probe searching is the meration method This method has its own merit, but it is too computationalexpensive Since evolutionary strategy can solve computational costly problem inrelatively short time, it could be used in searching probes of DNA microarray Thisthesis is mainly devoted into the development of (i) Searching Yeast Probe usingEvolutionary Strategy; (ii) Searching Human Probe using Evolutionary Strategyand BLAST
enu-In searching Yeast Probe, the classic evolutionary strategy is modified so thatfewer tests were performed on the uniqueness criteria, which need more time thanother two criteria Also, adjustments are made to solve premature convergence Insearching human probe, Basic Local Alignment Search Tools (BLAST) are used sothat the time on uniqueness criteria test is substantially decreased The result iscompared with enumeration method to demonstrate the effectiveness of evolution-ary strtegy in probe searching problem
viii
Trang 10to find the relation between gene variation and its consequences.
DNA microarray is a revolutionary technology in comparative gene sequenceanalysis Unlike traditional methods, which could only deal with two sequences,DNA microarray can monitor the whole genome on a single chip and vastly in-creases the number of genes that can be studied in a single experiment
DNA microarray is currently the most widely used tools for large-scale analysis
of gene expression and other genomic-level phenomena and patterns In a ray, gene-specific patterns (probes) are immobilized on a solid-state (including glassslides, silicon chips, nylon membranes and plastic sheets) and then queried withnucleic acids from biological samples (targets)
microar-In detail, the DNA microarray experiment is conducted as follows:
1 Nucleic acids (RNA or DNA) that is under research are isolated from ical samples (e.g., blood or tissue)
biolog-2 An array of gene-specific probes (DNA micro-array) is created or purchased
1
Trang 11There are several methods to produce the array Oligonucleotides (shortsingle stranded DNA molecules) can be synthesized in situ using photolitho-graphic techniques or phosphoramidite chemistry by ink jet printing technol-ogy (S.P Fodor, 1991; A.C Pease, 1994; S Singh-Gasson, 1999; T.R Hughes,2001) Alternatively, DNA molecules can be attached to glass slides or nylonmembranes (M Schena, 1995).
3 The isolated nucleic acids are converted into labeled targets through one ofseveral methods Targets can be labeled either with fluorescent dyes thatare covalently incorporated into complementary DNA (cDNA) or throughradioactivity
4 The labeled targets are incubated with the solid-state probes, allowing targets
to hybridize with probes accurately (A/T, C/G mode)
5 After incubation, nonhybridized samples are washed away, and measurementsare made of the signal (dye or radioactivity) which is produced during hy-bridization on particular probe location Because the identity of the se-quences on the array are typically known, the degree of hybridization at aparticular point on the array indicate the level of expression of the genecorrelated to that sequence
DNA microarray test is widely used in many genomic applications, which makes
it an important area under research The most common applications of DNAmicroarray includes:
1 Identify point mutation that can be associated with disease
2 Find genes whose expression is different under pharmacological and logical conditions
patho-3 Identify disease subgroups based on their unique gene expression profile
4 Predict the function of unknown genes based on the similarity of their geneexpression profile
Trang 125 Find biomolecular pathways that are affected by disease and therapy.
6 Identify prevalent expression patterns and identify DNA sequence patterns
7 Test drug-treated tissue samples for toxicological effects
8 Find genes in genome sequences
As already discussed, a DNA array is an array of gene-specific probes Thusprobes are critical in making DNA arrays In biological sense, a probe is a moleculehaving a strong interaction only with a specific target and having a means ofbeing detected following the interaction Gene-specific probes are nucleic acidprobes They interact with their complement primarily through hydrogen bonding,
at tens, hundreds or thousands or sites The interaction between nucleic acidsbase are specific because only the puring-pyrimidine pair can be incorporated intothe double helix at the proper H-bonding distance, and only guanine-cytosine oradenine-thymine purine-pyrimidine pairs are suitable pairs Thus, only G-C andA-T pairs are permitted to form stable probe-target hybridization
There are generally two kinds of nucleic-acid probes, i.e., biologically amplified(cloned or PCRed) probes or synthetic (oligo) probes In DNA microarray tests,synthetic oligo probe is used
Synthetic oligonucleotide probes have several advantages First, the oligonuleotideprobes are short in length Typically their length is less than 100 base-pairs (bp).This means a low sequence complexity and low molecular weight, which provideshorter hybridization time Second, oligonucleotide probe specificity can be tai-lored to recognized single base changes in target sequence since a single-basedmismatch in a short probe can greatly decrease the hybrid Third, synthetic oligoprobe is cost-effective
Since the probes on the array are synthesized rather than cloned, it is important
to know the sequence of the desired probes before they are synthesized
• Specificity The most important criteria for a qualified probe is its specificity.Because a probe is used to interact only with its target not other RNA, a
Trang 13probe can only be included in one gene i.e., it should be a unique sequence only appeared in the specific target sequence This is also known
sub-as its uniqueness criteria
• Sensitivity The other criteria is sensitivity Achieving good probe ity need favorable thermodynamics of probe-target hybridization and avoidunfavorable self hybridization Melting-temperature could well estimate thethermodynamics of a probe, and a suitable melting temperature is the sign offavorable thermodynamics This is also called melting-temperature criteria
sensitiv-To avoid self-hybridization, we need to ensure that the probe does not havehigh propensity to form secondary structure, mainly self-folding structure.This is the no self-folding criteria
The detailed criteria description could be found in chapter 3 and chapter 4
A sub-sequence that meet all these criteria can be a qualified probe To create
a microarray, we need to determine qualified probes for each gene (or exon) tionally brute force method is used Due to the large search space, this method iscomputational intensive For a typical gene with a couple of thousand base pairs,
Tradi-it takes millions of tests to find one probe This thesis makes effort to design a newalgorithm that can decrease the time in probe search with similar search result
In this thesis, Evolutionary Strategy (ES) is used in solving the probe search lem Evolutionary Strategy is one algorithm belonging to evolutionary compu-tation, a set of stochastic optimization algorithms The detailed description ofEvolutionary Computation and ES can be found in Chapter 2 Different spiceshave significantly different genome length and gene structure, hence the algorithm
prob-to find the probes In conclusion, this thesis has investigated and contributed prob-tothe following areas:
A Finding Yeast Probe
Trang 14DNA microarray is a powerful tool to measure the level of a mixed tion of nucleic acids at one time In order to distinguish nucleic acids with verysimilar composition by hybridization, it is necessary to design probes with highspecificities, i.e uniqueness Yeast is the first eukaryote spices with entire se-quence already found It has a comparatively simple gene structure and only 10Mbase pairs, which is relatively easy to find the probes using ES We make use ofthe available sequence information of all the yeast open reading frames (ORF) andcombined with an evolutionary strategy to search for unique sequences to representeach and every ORF in the yeast genome Since the time spent on three criteriatest are different, the incremental penalty function is used to decrease the number
popula-of uniqueness criteria, which is the most computational intensive criteria Thefitness sharing method is used to overcome premature convergence The probes of95% of all 6310 genes has been found
B Finding Human Probe of Chromosome12
Human genome are much more complex, which has an entire length of 2G basepairs The genes of human are yet accurately determined, so prediction of genesand exons (The coding part of genes, will be discussed in chapter4) are necessary
to find probes The computational time on uniqueness criteria is long compared
to simple specie BLAST, an algorithm that could determine the uniqueness of allsub-sequence of a gene/exon in a single test is used to minimize the computationaltime The probes of 90% all predicted exons are found, the results are comparedwith the result using brute force and discussed
The thesis is organized as follows Chapter 2 present a detailed description ofEvolutionary Computation, the main algorithm we used in the probe search prob-lem Chapter 3 investigate the yeast probe search, its algorithm is presented andresult discussed Chapter 4 investigate the human chromosome 12 probe search
In Chapter 5, general conclusion and suggestion of future work is give
Trang 15Evolutionary Computation
After a brief introduction of our research work, we will introduce evolutionarycomputation as our main algorithm in this chapter In section 2.1 we will describethe basic principles of Evolutionary computation In section 2.2 we will discussthe several variants of evolutionary algorithms We will outline the advantagesand disadvantages of evolutionary computation in section 2.3 In section 2.4 and2.5 we will introduce some techniques in constrain handling and crowding avoidingmethods, which are very important techniques and used in probe finding problem,respectively
Evolutionary computation (EC) represents a powerful search and optimization adigm Its underlying metaphor is a biological concept: that of natural selectionand genetics
par-EC is inspired by the natural process of evolution and make use of the sameterminology Its peculiarity is to maintain a set of points which was called as popu-lation that are searched in parallel Each point (individual) is evaluated according
to an objective function (fitness function) Then, a set of operations will be added
on the population These operations contribute to the two basic principle in theevolution selection and variation Selection means that the search should focus on
6
Trang 16a “better” region of the search space, which was achieved by giving higher bility to be a member of the next generation to an individual with “better” fitnessvalues Variation will create some new points in the search space as well as smallchange on the points remain in the next generation These variation operatorsincludes not only random changes on a particular point (mutations) but also therandom mixing from the information of two or even more individuals (crossover).
proba-A general EC algorithm will be like follows: the population is initialized with
a random sample of the search space Then the generation loop is entered First,the fitness values are calculated using the objective function Next selection isperformed using the current population and the current fitness vector Finallynew points are created from this population using variation and thus form thepopulation of the next generation This process goes on until some terminationcriteria met (e.g best individual found, no improvements in several generation,meet scheduled test time, etc.) There are also some EC that perform mutationfirst, and selection next (e.g Evolutionary Strategy)
The power of EC as a search technique lies in the fact that it is ized as combining features from both path-oriented methods and volume-orientedmethods(Back, 1994) EC combines these contrary features in the initial stage ofthe search that population is usually spread out in the search space, corresponding
character-to a volume-oriented search In later stage, the search will focus character-to few regions due
to selection based on fitness values, and these few regions will examined further
In this respect, the algorithm behaves like a path oriented search Another ble identification of these two stages of the search could be the correspondence ofthe first stage to a global reliability strategy and the second to a local refinementstrategy
possi-2.1.1 Selection
Selection is one of the two important operators in Evolutionary Computation It isintended to improve the average quality of the population by giving individuals ofhigher quality a higher probability to be copied into the next generation Thereby
Trang 17the search will be focused on promising regions in the search space.
The basic idea of selection is to prefer “better” individual to “worse” ones,where “better” and “worse” are defined by the fitness function As only copies ofexisting individuals are created more individuals will be located at “good” positions
in the search space This selection, followed by “exploitation”, which means knownregions in the search space to be examined further, will lead the search in the rightdirection The assumption hereby is, that better individuals are more likely toproduce better offsprings, i.e., that there is a correlation between parental fitnessand offspring fitness In population genetics this correlation is named heritability
If this assumption fails, selection of better individual makes no sense, and henceevolutionary computation will play no better than random search Fortunately,most real world search problem satisfy this assumption and hence could be solvedusing EC
A nice feature of the selection mechanism is its independence of the tation of the individual, as only the fitness values of the individuals are takeninto account This simplifies the analysis of the selection methods and allows acomparison that can be used in all kinds of Evolutionary Computation
represen-Most selection methods are generational, i.e., it has a generation concept Theselection will acts on the whole population, then the variation operators are applied
to the whole population However, there are some steady-state selection scheme.The steady state approach replaces only a few member in the population by apply-ing selection and recombination For example, one selection method is described
as followed, every time, two individual are selected out of the population, aftercrossover, the new offspring is inserted back into the population to replace oneparent (Whitley, 1989; Syswerda, 1989)
Listed are some common generational selection methods:
• Proportional Selection Proportional selection is the original selection methodproposed for genetic algorithm by Holland (Holland, 1975) The probability
of an individual to be selected is simply proportionate to its fitness value viously, this mechanism only work on fitness maximize question (i.e., larger
Trang 18Ob-fitness value means better Ob-fitness value), and it assume all Ob-fitness values aregreater than zero One great drawback of this selection mechanism is its non-translation invariant (Maza and Tidor, 1984) For example, assume a popu-lation of 10 individuals with the fitness values f = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).The selection probability for the best individuals is hence pb = 18.2% andfor the worst pw = 1.8% If the fitness function is shifted by 100, i.e., a con-stant value 100 is added to every fitness value, we find that p0
• Truncation Selection In truncation selection with threshold T , only the best
T individuals are selected and they all have the same selection probability.This selection method is introduced into Genetic Algorithm by Muhlenbein(Muhlenbein and Voigt, 1995) And it is just the same as (µ, λ)-selection inevolutionary strategy (Back, 1995)
• Linear Ranking Selection Linear ranking selection was introduced to nate the serious disadvantage of proportionate selection (Whitley, 1989) Forlinear ranking selection it is the rank of the fitness value that determine theprobability of an individual Let a population has N individuals The indi-viduals are sorted according to their fitness values and the rank N is assigned
elimi-to the best individual and the rank 1 the worst The selection probability islinearly assigned to the individuals according to its rank
Standard generational selection schemes do not guarantee that the current best
Trang 19individual will be contained in the next generation This may happen either due
to the probabilistic nature of a selection scheme or due to the fact that the bestindividuals are “lost” in mutation Consequently elitist election schemes have beenproposed by Jong (DJong, 1975) They copy the best individual of the currentgeneration to the next generation if no other new individual surpass it, and thus canensure that the best individual of next generation is no less that current generation
In our research, truncation selection is used since our main method is tionary strategy This method avoids the disadvantage of proportional selectionand proves to be effective
evolu-2.1.2 Mutation and Crossover
The selection operator is employed to focus the search upon the most promisingregions of the search space However, selection alone can not introduce into apopulation individuals that do not appear in the intermediate population Thus,
in order to increase population diversity, crossover and mutation are used Asthese operators usually create offspring at new positions in the search space, theyare called “explorative” operators The several instance of the EC differ in theway individuals are represented and in the realization of the crossover/mutation.Common representation include bit string, vectors of real/integer values, trees orany problem dependent data-structure
Along with a particular data-structure variation operators have to be definedthat can be divided into asexual and sexual variation operators The asexualvariation (mutation) consists in a random change of the information represented
by an individual
If the individual is represented as a vector, mutation is the random change of
an element of the vector If the vector is a simple bit-string, as in case of classicGenetic Algorithm, mutation is to toggle the bit If the vector is a real value
or integer value, as in case of Evolutionary Strategy, more complicated mutationoperators are necessary The most general approach is to randomly choose onevalue which was define by a probability distribution over the domain of possible
Trang 20values to replace the existed one.
The mutation operator for tree representation works as follows: a randomlychosen mutation site (an edge in the tree) is selected and the sub-tree attached tothis edge is replace by a new, randomly created tree
The crossover operator achieves the recombination of the selected individuals bycombining information from two selected individuals Two individuals are chosenfrom the population and named parents How the crossover is performed alsodepends on the chosen representation
Crossover is originally designed for bit-string vector representation, and henceseveral crossover operators are available for bit-string representation
In our algorithm, the candidate is represented as vector of integer, thus nocrossover is used and only mutation is used Detailed will be discussed in chapter
3 and chapter 4
• One point crossover (Holland, 1975) A position crossover point in the vector
is randomly chosen and all elements after this position are swapped thusformulate two new bit-strings, which represent two new individuals
• Two point crossover (Syswerda, 1989) Two crossover points are selected domly from the vector, and all elements between these points are exchanged
ran-to make new individuals This method can also extend ran-to N-point crossover
• Uniform Crossover (Ackley, 1987) No crossover points is needed in uniformcrossover In contrast, for each position of the offspring, the parent whichwill contribute the value of that position is chosen with a given probability
p For the second offspring, we take the value of the corresponding positionfrom the parent
For tree representation, the crossover operator reproduces two offspring fromtwo parents in the following way: In each tree, an edge (not necessarily a sameedge) is randomly chosen as crossover site (same as the crossover point in bit-string) and the subtree attached to this edge is cut from the tree and swapped and
Trang 21combined with the old tree to form the two offspring Generally, this will result intwo new trees even if the two parents are identical.
Considerable attention has been devoted to assessing the relative important ofcrossover and mutation, but still no accepted results Some researchers (Jones,1995; Beyer, 1995) found evidence that crossover could be simulated as a macro-mutation
Evolutionary Computation could be classified according to the difference in structures, selection methods and recombination methods In this section, themain stream in Evolutionary Computation will be briefly described, and theirorigins indicated More detailed discussion of the similarity and differences of thevariants of EC can be found in Back’s research (Back, 1994)
data-2.2.1 Evolutionary Strategy
Evolutionary Strategy originate in the work of Bienert, Rechenberg and Schwefel(Rechenberg, 1965; Schwefel, 1965; Schwefel, 1975) They initially addressed op-timization problems in fluid mechanics and then turn toward general parameteroptimization problems
The natural representation of ES is real-valued or integer-valued vectors as thegene And hence the selection and variation methods should suit the representationmethod
Generally, the selection method of ES is Truncation Selection The selectionmethod and population concept is defined by two variables µ and λ µ givesthe number of parents and λ describes the number of offsprings produced everygeneration There are two main approaches of ES, denoted by (µ+λ)-ES and (µ, λ)-
ES In the former, µ parents are used to create λ offsprings Then, all parents andoffspring together compete for survival, only the µ individuals with best fitnessvalue will survive to be the parents of the next generation In the latter, only the
Trang 22λ offsprings will compete for the survival, and the best µ individual among themwill be the parents of the next generation All µ parents are completely replaced.This is, the life span of any individual is limited to a single generation Obviously,this (µ, λ)-ES request that λ > µ.
No recombination is needed for ES, but only mutation Typically an offspringvector is created by adding a Gaussian random variable with zero mean and pres-elected standard deviation to each component of the parent vector
The idea of making the standard deviation of the mutation a parameter of theparent was introduced in 1970’s (Schwefel, 1981) In this procedure, the pertur-bation deviation itself is subject to mutation and thus optimized to the actualtopology of the objective function
2.2.2 Genetic Algorithm
Genetic Algorithm (GA) was introduced by Holland and his students at the sity of Michigan in 1970’s (Holland, 1975) Essentially, the “original” GA uses bitstring of fixed length representation, fitness proportionate selection and one-pointcrossover
Univer-The typical process of classical GA is as follows
• 1 The problem to be solved is defined and captured by an objective function(fitness function)
• 2 A population of candidates is initialized And each individual is coded
as a vector termed as a chromosome Holland suggest that representingindividuals by binary strings is advantageous (Holland, 1975)
• 3 For each chromosome, a fitness value is assigned to it according to theobjective function The fitness value should be positive and to be maximized
• 4 Proportionate selection will be used to choose out the parents, i.e., theparents will be randomly selected out of the population subject to a proba-bility of reproduction assigned to all chromosome which is proportionate toits fitness value
Trang 23• 5 From the selected parents, offsprings are created using one-point crossoverand mutation Offsprings will be the parents of the next generation Be-sides one point crossover, two point crossover and uniform crossover is alsoavailable in GA.
• 6 The process proceeds to 3, unless some stopping criteria is satisfied.Holland suggested using binary bit string, but this suggestion received consider-able criticism (Michalewicz, 1992; Fogel and Ghozeil, 1997) And currently, binarystrings are not frequently used, except problems that are obviously well mapped to
a series of Boolean decision Fogel and Ghozeil (Fogel and Ghozeil, 1997) provedthat there are essential equivalence between any bijective representation, thus nointrinsic advantage accrues to any particular representation
The mathematical theory underlying the design of GA is so called SchemaTheorem (Holland, 1975) It states that a GA works by combining small, goodpart of a solution Building Block to larger parts by the crossover-operator Anotherresult from this theorem is the use of proportionate selection It was regarded ashaving optimal trade-off between exploration and exploitation One-point crossover
is also suggested by this theorem because it could maintain good building blocksrather than disrupt it However, in practice, Uniform Crossover generally providedbetter solutions with less computational effort (Syswerda, 1989) The relevance ofthe Schema Theorem is currently unclear, though many successful applications of
GA have been published
Trang 24the machine The output is the prediction of the next input and will compare with
it The quality of prediction is measured by using a payoff function
A number of machines is presented as the initial population The fitness ofeach machine will be calculated Offspring machines are created through mutation,while no crossover is available Each parent will create one offspring, and only thebest machines among offspring and parents will be retained Typically half themachines are retained to make the population a constant size This process isiterated until an actual perdition of the next symbol (yet unexperienced) in theenvironment is required If so, the best machine generates this prediction, the newsymbol added to the environment and the process repeated
The current state-of-the-art in EP is so-called meta-EP, (Fogel, 1991; Fogel,1992) The selection mechanism is a mixture of tournament selection and trunca-tion selection The variance of mutation rate is incorporated in the genotype, thusmaking self-adaption (similar to ES) possible
2.2.4 Genetic Programming
Genetic Programming (GP) was introduced to develop computer programs forsolving specific problems in an automated way.(Koza, 1989; Koza, 1992) However,Genetic Programming could also be used in other application fields like functionoptimization where the shape of a function is evolved, not only the constant.Genetic programming used tree-shaped representation Usually the represen-tations are of variable size Both recombination and mutation is used as searchoperators
The first approaches of GP used proportionate selection However, currentlythe preferred selection scheme is tournament selection, which was found empiricallysuperior (Koza, 1994)
From the nature of probe search problem and sample test, evolutionary gies proves to be the best suitable method and hence our algorithm is based onevolutionary strategies (See chapter 3 and chapter4)
Trang 25strate-2.3 Advantage and Disadvantage of
• Suitable for complex search problem: Complex search problem are referred
to those problem that no problem-specific heuristic algorithm exist In thoseproblems, there are generally high correlations between variables, i.e., thechoice of one variable may change the quality of another one Evolution-ary Computation has proved to be successful in solving such kind of searchproblems, though careful choice among available EC variants and selection,crossover and mutation methods is very important to achieve good perfor-mance
• Robustness: Though Evolutionary Computation is a heuristic searching method
in essential The performance of EC is not randomly given I.e., differentruns of an EC for the same problem generally give similar results This is anadvantage to other heuristic method
• Inherent Parallelism: The population concept of EC makes parallelizationeasy, which means the execution time of EC can be reduced greatly if morecomputers are used
Though Evolutionary computation is proved to be a good searching technique,
it still has some weakness
• Heuristic Principle: Evolutionary Computation is a heuristic searching method,this means that EC do not guarantee to find the global optimum in a given
Trang 26generation And we still have no theory to predict the accuracy of the result
we get in a limited computation time, i.e., the convergence rate of ary computation is still in doubt under complex search problems
evolution-• Parameter Adjustment: Several important parameters, such as the tion size, crossover rate, mutation rate, will affect the performance of EC
popula-To tune these parameters is important in constructing good algorithm Nofree lunch theorem (Wolpert and Macready, 1997) proves that any heuristicmethod is in general same as random searching method This means, if EC
is good at some problem, there will always be some problem that EC willperform worse than random search And it also shows no single choice ofvariation, selection, population size and so on can be best in general So tofind a set of good parameters for the problem on hand is always a problem
to be solved
• High computational demand: The modest demands on the knowledge ofproblem to be solved is paid with a relatively high computational demand.I.e., if there exist a problem specific algorithm, it will generally out-performthe evolutionary computation which needs little problem specific knowledge
In this section, we will discuss several methods for handling feasible and infeasiblesolutions in a population If Evolutionary Computation is used for constrainedoptimization problems, it should incorporate the information of constraint violationinto the fitness value because all information of the quality of an individual isdetermined by its fitness value alone Currently no universal constrain handlingmethod for Evolutionary Computation is available, the main approaches will belisted here, and could also be found in Michalewicz’s research (Michalewicz, 1995a;Michalewicz, 1995b)
• Rejection of infeasible individuals: This “death penalty” method is a lar option for constrain handling This method is really simple and strait-
Trang 27popu-forward, and there is no need to evaluate infeasible solutions when using thismethod.
However, generally this method only works well on those problem where thefeasible search space is convex and constitute a reasonable part of the wholesearch space If the problem is a highly constrained one, this method per-forms worse as most time will be spent in creating and rejecting individuals.Moreover, for a non-convex feasible region, reach optimum by “crossing” theinfeasible region is essential while unrealistic with “rejection infeasible indi-viduals”
• Repair of infeasible individuals: In this approach infeasible solutions aretransformed into feasible ones with a special repair algorithm This method
is popular among evolution computation community for it is relatively easy
to repair an infeasible individual in many optimization problems
The weakness of repairing method is that this method is highly problem cific There are neither standard repair algorithm nor standard heuristic todesign such repair algorithm For some problems it is easy to find one re-pair algorithm However, for some problems, to design a process of repairinginfeasible individual is as complex as solving the original problem
spe-• Special representations and operators This method uses specialized tation method and operators to ensure that all individuals are feasible Theevolutionary computation algorithm using this method often performs betterthan using other method But the problem is such special representation andoperators may be difficult to find or even non exist, especially for numericaloptimization problems
represen-• Penalty functions The most widely used method in constrain handling inEvolutionary Computation is the use of penalty function In this case, thefitness function f0(p) is a combination of objective function (the previousfitness function) f (p) and the penalty function Q(p) I.e., f0(p) = f (p) +Q(p) The penalty function Q(p) represents either a penalty for infeasible
Trang 28individual or the cost to repair it In the case an individual p is feasible, i.e.,
no constraint are violated, the penalty function should be zero
By adding a penalty function, the constrained optimization problem is formed into an unconstrained optimization problem with a different objectivefunction f0(p) Obviously the optimal point of f0(p) should be in the feasibleregion of f (p), i.e., should be the optimal feasible point of f (p)
trans-A problem exists in determining the strength of the penalty If a high degree
of penalty is imposed, more emphasis on obtaining feasibility will be placed.The algorithm will move quickly to the feasible region , while it is likely toconverge to a point far from optimum This is similar to the case using re-jecting infeasible individual method In contrast, if too low degree of penalty
is used, the algorithm may converge to an infeasible point, also fails to findthe optimal feasible point (J.A Jonies, 1994)
To find a good penalty function, the relationship between infeasible ual 0p0 and the feasible region plays quite an important role This means,for a infeasible individual, which is quite near the boundary of feasible re-gion, it should be given a low penalty function compared to those infeasibleones that are far from the feasible region As Richardson found “penal-ties which are functions of the distance from feasibility are better performerthan those which are merely functions of the number of violated constraints”(J T Richardson and Hillard, 1989) Further more, rank-based selectionschemes are proposed to be better than proportionate selection as they avoidscaling problems with penalty function (D Powell, 1993)
individ-Siedlecki found that “the genetic algorithm with a variable penalty cient outperforms the fixed penalty factor algorithm” (W Siedlecki, 1989).Based on this, Michalewicz introduced an dynamic penalty function algo-rithm (Michalewicz, 1995b) The penalty function Q(p) against constraints
coeffi-g1 to gq is
Trang 29to some “cooling scheme” and is called “temperature”.
• Objective Switching: Objective Switching (M Schoenauer, 1993) first evolvethe initial random population with an objective function which is only re-lated to the feasibility of one constraint If a given percentage of the evolvedpopulation fulfills the constraints, the objective function will change to thenext constraint, which the population violate previous constraints will begiven high penalty If reasonable percentage of individual satisfy all con-straints, the objective function is switched to the original fitness functionwith a rigorous penalty on violation of constraints This method will be used
in searching yeast probes and demonstrate great effectiveness
In summary, penalty function is the most popular method to handle constraints.However, no universal solution available for constrain-handling and the best con-strain handling method is the one most fit the problem on hand
Premature Convergence is an important concern on Evolutionary Computation.Though it is more emphasized in the GA community, it is a universal problemfaced also by other kind of EC algorithms Several methods overcoming it hasbeen devised and will briefly be discussed in this section
Premature convergence of occured in complex search space, especially in multimodal space, i.e., several even many peaks (sub-optimum) available, separated bylow fitness area Because of the “exploitation” effect of selection operators, higherpercentage of the individuals will gather around the current best individual Asthis process goes on, in the case of GA, the population of chromosome will reach a
Trang 30configuration such that crossover no longer produces offspring that can outperformtheir parents, as must be the case that all current individuals are converged tocurrently found best individual, and hence the global optimum missed In the case
of ES, the population of individuals will be around one peak Since it is separatedfrom other peaks by low fitness area, mutation could not go through the low fitnessarea, no new peak could be found, and hence the global optimum missed similar
to the case in GA
Essentially, premature convergence is due to the loss of diversity of chromosome.The nature solved this problem by forming stable subpopulation of organisms sur-rounding separate niches by forcing similar individual to share available resource
In evolutionary computation, similar methods could be used These methods arecalled as niching methods and are listed below:
• Crowding Scheme: In crowing scheme (DJong, 1975), separate niches is duced by replacing existing strings according to their similarity with otherstrings in an overlapping population First, two parameter G and CF should
pro-be determined (De Jong suggest G=0.1 and CF=2 or 3) G is the generationgroup which means that only a proportion G of individual of the population
is permitted to produce offspring in each generation The method to ensureniching is like follows: When one new individual is produced and need tofind one individual to die, CF individual are picked out randomly from thepopulation, and the one which is most similar to the new individual will bechosen to be replaced by the new one
• Deterministic Crowding: The original Crowding Scheme is modified by foud, and named as Deterministic Crowding (DC) (Mahfoud, 1992; Mah-foud, 1994) DC works as followed First, all N individuals among the pop-ulation are divided into N/2 pairs With crossover and mutation, each pairswill yield two offspring Each offspring will compete with one of its parentsfor survival, and its “brother” compete with the other parent There will betwo parent-child competition sets, and DC choose the competition sets thatthe most similar elements will compete By this, it can maintain diversity
Trang 31Mah-and create niches among population.
• Sequential Niching: Sequential Niching (D Beasley, 1993) is an iteration ofsimple EC It uses traditional EC until it converge to one point, record thebest individual (one candidate), then restart the EC algorithm To avoidconverging to the same area, all the points near the already found candidateswill be given a low fitness The author hope that this method will locate allsub-optimum as candidates
• Fitness Sharing: Fitness sharing is inspired by the resource sharing in nature(D.E Goldberg, 1987) In nature, if more individual are gather around oneplace, the resource (food, water) they have will be divided among them andless than if only one individual is there In EC, the fitness of an individual will
be derated by an amount related to then number of similar individuals in thepopulation The process of fitness sharing is as followed For a maximizationproblem, first, we need to specify a sharing function which is a function ofthe distance between two individuals The result of the sharing function isrelated to the distance of the two individual, the further the two individual,the smaller the function It will return “1” if the two individuals are identical,and will return “0” if they cross some threshold of dissimilarity δshare Thenfor each individual, we calculate out its niche count, which equals to the sum
of the sharing function between itself and each individual in the population(including itself) Obviously, the least amount of the niche count is 1 Theshared fitness equals to its fitness (given by objective function) divided byits niche count The selection is then based on its shared fitness Clearly,
if an individual is crowded, i.e., many similar individual in the population,its niche count is large Then it has a smaller shared fitness, hence lessopportunity to be selected and have offspring
Generally, sharing will yield good performance in multi modal optimizationproblem, though the construct of sharing function is critical to achieve good per-formance (K Deb, 1989; Mahfoud, 1995) In finding yeast probes (chapter 3), we
Trang 32will use fitness sharing to prevent premature convergence, which greatly increasethe accuracy of the search.
Trang 33Finding Probes of Yeast Genome using ES
DNA microarray, also known as DNA CHIP, is a revolutionary technology thatinvolves immobilization of a large numbers of different DNA molecules within asmall confined space (R.J Lipshutz, 1999; D.J Lockhart, 2000) Over the years,several technologies have been developed to attach DNA molecules to solid plat-form Oligonucleotides (short single stranded DNA molecules) can be synthe-sized in situ using photolithographic techniques or phosphoramidite chemistry
by ink jet printing technology (S.P Fodor, 1991; A.C Pease, 1994; S Gasson, 1999; T.R Hughes, 2001) The precision of photolithographic technologyallows the synthesis of high resolution and extremely high density DNA microar-rays Alternatively, DNA molecules, typically in the form of double stranded PCR(polymerase chain reaction) products or oligonucleotides, can be attached to glassslides or nylon membranes (M Schena, 1995) The latter method is a more practicaland cost-effective avenue of making DNA microarrays by most standard labora-tories In addition, it offers the flexibility of printing DNA of choice onto solidplatform The main objective of this chapter is to search for these oligos or probeset for the subsequence analysis on the microarray
Singh-24
Trang 34For gene expression profiling, ribonucleic acids (RNA) is the subject of surement with DNA microarray The RNA is typically reverse transcribed to givecomplementary DNA (cDNA), and the DNA is then labeled with fluorescent dye.Upon denaturation of both the immobilized DNA and labeled cDNA, the mixture
mea-is allowed to hybridize Hybridization mea-is a process in which complementary basesbetween single-stranded DNA associate together to form stable, double stranded,anti-parallel DNA via hydrogen bonding The process of annealing (reassociation)
is also highly specific as cytosine (C) forms the strongest interaction with guanine(G) and adenine (A) with thymine (T) After hybridization, labeled DNA which
do not form specific interactions with the immobilized DNA on the microarray can
be removed by washing with solvent Therefore, labeled DNA that are retained onthe microarray can be quantitated based on the fluorescence intensity
The stability and association between complementary DNA molecules criticallydepends on the melting temperature (Tm) Tm is operationally defined as the tem-perature in which 50% of a single stranded DNA annealed with its complement toform a perfect duplex The Tm is governed by several factors: base composition,DNA concentration, salt concentration, and the presence of destabilizing chemicalreagents As a GC base pair is held together by 3 hydrogen bonds while an ATbase pair has only 2 hydrogen bonds, GC rich sequence has a higher Tm compared
AT rich sequence Higher concentration of DNA favors duplex formation andconsequently the Tm is higher As cations stabilize DNA duplexes, higher salt con-centration raises the Tm Chemicals such as formamide or DMSO destablise DNAduplexes and therefore has a negative effect on Tm In a typical microarray exper-iment, thousands of DNA spots on the microarray interact with a very complexmixture of labeled DNA under a single condition Therefore, optimal hybridiza-tion condition is necessary to obtain the best result One way to attain optimalhybridization is to control the Tm of the immobilized DNA on the microarray.The yeast Saccharomyces cerevisiae is the first eukaryote genome that has beensequenced (A Goffeau, 1996; H W Mewes, 1997) The Saccharomyces cerevisiaehas approximately 6000 genes The gene structure of this yeast is also relatively
Trang 35simple, compared to higher eukaryotes For examples, very few genes containintrons and most of the open reading frames (ORF), which are protein codingsequences, are preceded by promoters Since, detailed sequence information isknown for all predicted gene in this organism, we attempt to design algorithm tofind unique DNA sequences, with optimized Tm, that can be printed onto DNAmicroarrays.
Our motivation is thus to search for probes within each ORFs so that the probesare unique Due to the large search space and constraints, the searching of theseprobes using traditional searching methods is computationally intensive Our ap-proach is to use make use of computational intelligence techniques, in this case,evolutionary strategy (ES) in searching these probes For this specific problem,some modification of the traditional ES, namely new constraint handling and pre-mature prevention methods is necessary, the details will be discussed in the nextfew sections We note that existing methods for finding the probe sets of variousgenomes are currently only available in private domains involving high commericalvalues (www.operon.com, 2000) Hence, any new results would be valuable to thepublic with new genomes constantly being uncovered
This chapter is organized as follows The criteria and specifications of the probesearch is given in Section 3.2 Section 3.3 presents the evolutionary strategy usedfor searching the probe set Results are presented and discussed in Section 3.4.Conclusions are given in Section 3.5
The basic consideration of designing oligonucleotide probes are specificity and sitivity Specificity means that a probe must hybridize primarily with its target,i.e., a probe should avoid cross hybridization To ensure this, the probe should
sen-be a unique sub-sequence that appears only in a specific ORF The ideal way todetermine the specificity of a potential probe would be to check if it appears inother ORFs Achieving good probe sensitivity requires favorable thermodynamics
of probe-target hybridization and avoid unfavorable self hybridization
Trang 36Thermo-dynamics of probe target hybridization can be well approximated by calculatingthe melting temperature, Tm Since microarrays involve hybridizing many probessimultaneously, there should be uniformity in the thermodynamics of probe hy-bridization across the chip Requiring probes to have Tm within a certain rangehelps to maintain this uniformity In order to avoid self-hybridization, probeswhich have a significant propensity to form secondary structure (i.e., probe selffolding-back) have to be eliminated Secondary structures in the probe will act as
a barrier to hybridization between the probe and its target One way to determinethe possibility of forming secondary structure is to check whether the probe haslong complementary pairs
In short, there are three criteria essential for a qualified sequence: (1) ness of the sequence; (2) the sequence should have a melting temperature within
unique-a specific runique-ange; unique-and (3) the sequence should not hunique-ave complementunique-ary punique-art whichcould cause folding back of the sequence A qualified probe/sequence is thus onethat satisfies all these three criteria Next, we define three functions funi, ftem, fnf b
to represent the uniqueness, the Tm, and the no folding back criteria, respectively.These three criteria are all true-false criteria, i.e., a probe can either satisfy thecriteria or not as illustrated below
For the ith ORF, Si represents the whole set of its subsequences
funi, ftem, fnf b: Si → 0,1
for every s that belongs to Si, define funi, ftem, fnf b as
funi(s) = 1 if s is unique (i.e does not appear in other genes),
= 0 if s is not unique;
ftem(s) = 1 if the melting temperature (Tm) of s is in the desired range,
= 0 if the Tm of s is not in the desired range;
fnf b(s) = 1 if s has no complementary sequence,
= 0 if s has complementary sequence which will cause folding
A qualified probe which satisfies all three criteria will be equal to 1 We canthen define a function f to represent whether a sub-sequence is qualified or not asfollows
Trang 37f (s) = funi(s)∗ ftem(s)∗ fnf b(s)Thus, for any subsequence s, s is qualified if and only if f (s) = 1 The task offinding a qualified sequence can be described as finding a set of sequence si in Si,which satisfy f (si) = 1 ∀ i = 1, 2, · · · , n where n is the number of ORFs We notehere that the results of the function f is either 0 or 1 In the next few sections,
we will illustrate how to reformulate these functions such that they are suitablefor searching the desired probes The total number of ORFs for yeast is 6310
To illustrate our approach, we will focus our discussion on only one ORF, namedQ0010 Samples of probes found in other ORFs are presented in Section 3.4
3.2.1 Uniqueness criteria
There are two main characteristics of the uniqueness criteria which is critical to thedesign of the algorithm First, from simulation, it was found that the computationtime of the uniqueness criteria is about 10− 100 times more than the other twocriteria For example, consider the Q0010 ORF of length 388 bps (base-pairs) Ar-bitrarily choosing two locations from the 388 bps as the starting and ending point,
a sub-sequence can be found Thus, the total number of possible subsequences
is 75078 (i.e., 388∗3872 ) The entire length of all other ORFs combined is about 9million bps To determine whether one sequence appeared in this long database is
a computationally expensive task: tenths of seconds on a HP-UX workstation Let
n be the length of a sequence, and m the length of the entire ORF, the tational time will be O(mlog(n)) It is thus unrealistic to test all subsequences ofany one ORF, let alone the whole genome The computational time of Tm criteria
compu-is O(n), and the computational time of non-folding compu-is O(n2) Since m is muchlarger than n, the three criteria need substantially different computational time
To minimize the computational cost, our approach is to compute the other twocriteria/constraints before testing this one
Next, it was known that for some sequences, they have some similar sequences which may perform the same function (for example, some sequence could
Trang 38sub-encode specific protein domains) (D Higgins, 2000; C Brown and Jacobs, 2000).These common sequences are distributed all over in the ORF, making the feasibleregion discrete and non-linear Figure 3.1 illustrate the feasible region of Q0010based on the uniqueness criteria Notice that the probability that a sub-sequence
is unique is not linearly related to its length Figure 3.1 shows that the sequence(300, 388) of length 89 bps, is not in the feasible region even though it is substan-tially longer than the average non feasible sub-sequence (about 20 bps)
50 100 150 200 250 300 350
Feasible Region
Trang 393.2.2 Melting temperature criteria
The melting temperature, Tm, of an oligonucleotide refers to the temperature atwhich the oligonucleotide is annealed to 50% of its exact complement As discussedpreviously, the Tm is directly related to the thermodynamics of a probe, and henceits sensitivity For subsequence processing using the microarray, the probes orsub-sequences should have a Tm in the specific range
A number of methods exists for the calculation of Tm, one of the more accurateequations for Tm is the Nearest Neighbor Method (K.J Breslauer and Markey,1986; J Santalucia, 1996):
∆H(GAT C) = ∆H(GA) + ∆H(AT ) + ∆H(T C) The values of ∆H and ∆S can
be found in (K.J Breslauer and Markey, 1986) R is the molar gas constant, C isthe concentration of the probe, [K+] is the salt concentration In searching for thequalified sub-sequence, R is set as 1.987 cal/(oCmol), K+ is set to 50 mmol and
Tm of one sub-sequence (≈ 0.015 sec) is almost negligible compared to that of theuniqueness criteria
3.2.3 Non folding-back criteria
As discussed above, a qualified sub-sequence should have a low probability to formsecondary structure; otherwise the secondary structure will prevent the hybridiza-tion between the probe and its target In a probe, if one section of the ORF is
Trang 40Figure 3.2 The spread of the melting temperature function, ftem.
the same as the complement of another section in the reverse direction, it will
be a complementary pair For example, section “A− C − C − G − T − T ” and
“A− A − C − G − G − T ” is a complementary pair (reverse one of them, the twoare complementary according to base pairing rules A-T and G-C, see figure 3.3).The longer a complementary pair is, the higher is the probability that the probewill fold-back to form second structure on this complementary pair As a rule ofthumb, the parameter (specifying the length of complementary pair) of the non-folding test is set to 7, i.e., if a probe has complementary pairs equal to or longerthan 7 bps, it is disqualified due to its high probability to form secondary structure.Notice that the fitness area of the non-folding back criteria and the fitness area
of the uniqueness criteria has a small common area And in fact the two criteriaare contradicted, this render difficulty on the search of the qualified probe Thecomputational cost is about 0.1 second per test, and lower than the uniquenesscriteria test