The work described in this book was first presented at the Tenth Workshop on netic Programming, Theory and Practice, organized by the Center for the Study ofComplex Systems at the Univer
Trang 1Genetic and Evolutionary Computation
Practice X
Trang 2Genetic and Evolutionary Computation
Trang 4Rick Riolo • Ekaterina Vladislavleva
Editors
Genetic Programming Theory and Practice X
Foreword by Bill Worzel
123
Trang 5and Molecular Biology
The Pennsylvania State University
University Park
Pennsylvania, USA
Ekaterina VladislavlevaEvolved Analytics Europe BVBABeerse, Belgium
Jason H MooreInstitute for QuantitativeBiomedical SciencesDartmouth Medical SchoolLebanon, New Hampshire, USA
ISSN 1932-0167
ISBN 978-1-4614-6845-5 ISBN 978-1-4614-6846-2 (eBook)
DOI 10.1007/978-1-4614-6846-2
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2013937720
© Springer Science+Business Media New York 2013
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect
pub-to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Trang 6This tenth anniversary edition of GPTP is dedicated to the memory of Jason Daida Jason’s presentations at the seminal GPTP workshops on structure and reachability inspired and greatly influenced our thinking and guided our research Although his
passion for teaching and education prevented his attendance at recent workshops, it was always a joy to encounter him be it at a conference or during one of many trips to UM’s sister university in Shanghai A quick and innovative mind coupled with a ready smile and positive outlook is a tough
combination not to cherish Jason’s many students, friends and colleagues are
testimony to his clear vision, dedication to learning and his love of life We will miss him dearly.
Trang 8to give my input on a workshop on genetic programming (GP) Carl felt that as
a growing, cutting edge field, it would be both useful and interesting for PSCS tosponsor a “state of the art” workshop of GP As we discussed the idea, both Carl and
I envisioned a one-time workshop that would bring together people actively working
in the field Little did I know that workshop would become a crucial part of my lifeand a regular event in my annual calendar
At the time GP was still quite a young discipline despite 20 years or more effort
on the part of many researchers Carl was looking for a unifying theme for the shop and after a few minutes of reflection, I suggested a theme of GP theory andpractice, where computer scientists studying the theory of GP and practitioners ap-plying GP to real world problems could meet and discuss their respective progress
work-It was my thought that such a meeting could provide a review of the current state oftheory and that GP programmers could use a better understanding of GP theory toimprove the application of GP to “real-world” problems Conversely, practical re-sults are the ultimate test of theory Carl was enthusiastic about this idea and much
to my surprise, asked me to work with Rick Riolo to organize the workshop.Working with Rick was both a pleasure and an education As I had never beeninvolved in organizing an academic conference or workshop, I let Rick lead theway Rick and the PSCS staff not only handled the logistics of the conference, but
he knew the right questions to ask about format and content We decided to try tohave a matched pairing of theory and practice papers where possible, knowing thatthis would often be difficult We also had long discussions about the format of theworkshop It was my idea that we should have longer times for presentations thanwas normal for conferences as well as plenty of time for discussion We also decidedthat at the end of a set of related presentations, we should provide time for discussionreflecting on the set of presentations and what bigger questions they raised Thesedecisions have proved to be fruitful as many times the extended discussion sessionshave been the most valuable part of the workshop
vii
Trang 9Initially we conceived of the workshop as a place where people could presentspeculative ideas that they might not otherwise talk about at a peer reviewed confer-ence Instead, we opted for chapters to be written by presenters that were reviewed
by other workshop participants and published in book form While this meant thatall attendees’ submissions would be accepted, they nevertheless went through se-rious review that often radically changed the chapter as did the lengthy discussionsessions during the workshop
Another element we added was a daily keynote Originally we planned for a eralized topic for a keynote speaker on each day: One day was to be keynoted bysomeone in evolutionary biology, one on evolutionary computing and one by some-one who had expertise in integrating cutting edge technology into commercial appli-cations While this strict format has not survived, its spirit has survived and over theyears the keynotes have spawned many fruitful discussions both during question-and-answer sessions after the keynote and in many discussions that extended lateinto the evening
gen-At the end of the first GPTP, it was by no means certain there would be a ond workshop It had been successful, but was not an unalloyed success in terms ofcontent and quality What was an overwhelming success was the interesting discus-sions at the workshop and deep into the nights at the end of each day A little to mysurprise, when asked whether they thought a second workshop was in order, therewas an enthusiastically positive response from the attendees and from the entitiesthat had provided financial support for the workshop, including the PSCS
sec-Over the years that have followed, the format has modulated somewhat, PSCS came a Center (CSCS) but the general ideas we settled on that first year, speculativepresentations, diverse keynotes, large amounts of discussion time and cross-reviews
be-by participants, have largely stayed intact Moreover, over time the workshop hasdeveloped its own flavor and style that has led people to return; some annually, oth-ers biannually and still others only when they had something new to say
Theory and Practice?
Perhaps the best way to describe the organizing principle of GPTP is the quotationattributed to Jan Schnapsheut (and Yogi Berra!) “In theory there is no differencebetween theory and practice But in practice there is.” The first thing that quicklybecame apparent from the early GPTP workshops is that practice always outrunstheory because it is much easier to think up a new scheme that helps to solve aproblem but much less so to explain the mathematical reasons why such a schemeimproves the fundamental function of the underlying algorithm
The other thing that emerged was that practitioners became ersatz theorists, veloping tools and metrics to test and explain behaviors in GP Not only did this lead
de-to modifications of existing algorithms and new techniques that were clearly shown
to improve outcomes, but it spurred new theoretical consideration of GP Theoristsbegan to move from work on such fundamentals as the building block hypothesis to
Trang 10Foreword ixbroader questions that approached some of the questions the evolutionary biologistswrestle with such as: What are the constraints on evolution? What are the dynamics?What are the information theoretical underpinnings of GP? There is also a growingsense that researchers in natural and artificial evolution have something to say toeach other.
to our improved understanding of GP and others, just because What follows is thelist of my choices from the first 10 years and some brief comments on them This
is by no means an exhaustive list or even a list of the “best” work done (but thenevolution favors diversity over optimization), and I hope that people such as TrentMcGonaghy, Erik Goodman and the many other people that I omitted from the listwill not interpret this as lessening my respect for them or their work
GPTP I:“Three Fundamentals of the Biological Genetic Algorithm” by Steven
Freeland
This keynote by the evolutionary biologist, Steven Freeland, outlined tal characteristics of natural evolution that he felt should be adopted by geneticprogramming Some of the items he mentions include particulate genes, an adap-tive genetic code, and the dichotomy between genotype and phenotype He alsosets a standard for measuring the success of evolutionary computing when he says
fundamen-“Biology will gain when evolutionary programmers place our system within theirfindings, illustrating the potential for biological inspiration from EC [EvolutionaryComputing].”
GPTP II: “The Role of Structure in Problem Solving by Computer” by Jason
Daida
This chapter shows that there are natural limits on trees (and perhaps other relatedstructures) that constrain the likely range of program-trees that can be created bystandard genetic programming This raises fundamental questions that have not beenfully addressed in subsequent work
GPTP III: “Trivial Geography” by Spector and Klein.
Spector and Klein showed that by creating a sense of place for individuals in apopulation and constraining their crossover partners to those in the near neighbor-hood, a significant improvement in efficiency and effectiveness can be realized Italso implicitly raises the question of an environment for evolution since once youhave a sense of geography you can vary what is found in different locations (i.e.,ecosystems)
Trang 11GPTP IV: “Pursuing the Pareto Paradigm: Tournaments, Algorithms and
Ordi-nal Optimization” by Kotanchek, Smits and Vladislavleva.
While the usefulness of Pareto optimization has long been recognized in tionary algorithms, this chapter was one of many chapters over many years by theauthors that demonstrated that Pareto optimization is a key technique for effectivegenetic programming Evolutionary programmers ignore it at their own risk
evolu-GPTP V: Towards an Information Theoretic Framework for Genetic
Program-ming” by Card and Mohan.
This is the beginning of a long and arduous journey by Stu Card and his ciates to provide a model of genetic programming built on information theory Nowreaching its final, most general state, this may be the most important piece of theo-retical work in the GP world yet As a small joke, I once mentioned to Stu that sinceLee Smolin proposed in his book The Life of the Cosmos that our universe evolvedfrom earlier universes, Stu’s work would be The Theory of Everything
asso-GPTP VI: “A Population Based Study of Evolution” by Almal, MacLean and
Worzel
This study done by my team imaged the dynamic changes of a GP populationand demonstrated behaviors similar to those of natural populations, suggesting that
GP behavior is closer to natural evolution than had previously been thought
GPTP VII: “Graph Structured Program Evolution: Evolution of Loop
Struc-tures” by Shirakawa and Nagao.
I believe that using graph structures may lead to more powerful forms of GP and
as an explicit structure altering technique, may overcome some of the limitationsoutlined by Daida in The Role of Structure in Problem Solving by Computer Whilethis chapter is fairly limited in its results, its method is powerful
GPTP VIII: “Genetic Programming of Finite Algebras” by Spector et al.
This is not actually a chapter to be found in a GPTP book, being presented instead
at GECCO in 2008, but Lee Spector presented this informally at GPTP-2009 It is
an important paper in that he showed that GP was able to prove algebraic theoremsthat were too complex for human solution
GPTP IX: “Novelty Search and the Problem With Objective Functions” by
Lehman and Stanley
This chapter is noteworthy if for no other reason, than because it calls into tion the use of objective functions focused on accomplishing a specific result (evenincluding the case of multi-objective functions) Instead it suggests that the searchfor novelty in GP derived programs may be more important, arguing that there isevidence in nature that novelty is more important than some hypothetical optimum.Moreover, it reinforces the argument that a more complex environment may yieldbetter results
ques-GPTP X: “A Practical Platform for On-Line Genetic Programming for
Robotics” by Soule and Heckendorn.
This was presented at GPTP-2012 by Terry Soule and will appear in the bookyou are holding (or reading online) It was built on a simple premise: Terry’s group
at the University of Idaho wanted to have a simple, easily programmable robot as
a testbed for using GP in robotics After looking at commercially available options
Trang 12Foreword xifor research robots, Terry concluded that there needed to be a less expensive, yetpowerful and easily upgradeable platform as a testbed They settled on a platformbuilt from a number of off-the-shelf (OTS) components, with the computer being asmart phone I include this both because I think it is an important tool for the GPcommunity and because of the cleverness of how they assembled the components tomake an inexpensive but powerful robot.
Thoughts on the Future of GP
Finally, as is typical after a review of the past, I want to take a guess at the future
of GP, in the form of suggestions of desirable paths to be taken As Alan Kay oncesaid: “The best way to predict the future is to create it.”
The GP community has a powerful opportunity to create the future as the ued growth of GP and its applications seems likely as the volume of data generated
contin-in all disciplcontin-ines contcontin-inues to grow Methods such as GP, that can take data and turn
it into information, will be of increasing importance Of course, my suggestions ofhow we should approach the future are predictably biased by my experience andtaste, so buyer beware!
The first area that seems ripe for further work is the growing collaboration tween biologists and the GP community Evolutionary biologists and evolutionarycomputer scientists not only share an interest in understanding the complexity ofnatural and computational evolution, but they also share a goal of building bet-ter models of complex processes Some items where GP can build toward biologyharken back to Steve Freeland’s keynote in 2003 where he recommended imple-menting a particulate gene model, diploidal chromosome structures and buildingmore complex ecologies All of these have been tried at one time or another in thehistory of GP, but I believe the time is right to produce a focused effort to buildsystems that integrate all of these elements
be-On the flip side, deeper collaborations between the GP community and tionary theorists seems likely because of the growing use of computer models bybiologists in all areas The GP community can help in developing models by creat-ing empirical models from biological data that can provide insight into first princi-ples models that produce the data Moreover GP tools can be used to image entirepopulations and model the dynamics of evolution
evolu-The second area that I view as a rich area of exploration for the GP community
is the question of what algorithms match the timescales of the systems being eled and the possibility that GP could integrate different algorithms effectively Thepoint here is that in nature, evolution works on one timescale, ecology another andbiology yet another In machine learning techniques, neural nets work quickly, oncethey are trained Artificial immune systems work on a longer timescale, respondingsomewhat more flexibly and evolutionary algorithms work on another timescale Isuspect that effectively integrating these different techniques may depend on rec-ognizing the timescale on which they are most effective It may also be possible
Trang 13mod-to evolve an integrated solution using evolutionary algorithms mod-to select componentalgorithms to solve larger computational problems with timescales as the constraint.
I think this is particularly likely to be valuable in robotics, simulations and games(where many innovations first find a commercial home.)
Finally, I would like to call for GP to be applied to even more complex problemsthan has been the case to date As our computing resources have continued to grow,and our improvement of fundamental algorithms and tools has progressed, it may
be possible to address more difficult problems Some areas may include symbolicproofs, complex problems such as the n-body problem or ecological models Thehistory of GPTP suggests that we may be at the point of pushing GP into moreadventurous applications
My view of the future of GP may be summed up by the following question: Ifthe Singularity arrives, will it be by design or evolution?
And in Conclusion
Some years ago, at one of the earliest GPTP workshops, Rick Riolo once described
GP as “ an art struggling to become a craft.” It is safe to say that with the modern
tools and improved understanding of the GP mechanisms that has been generated inthe last 10 years, it is at least a craft, and is beginning to be closer to an engineeringdiscipline than ever before
While it would be a gross exaggeration to say that this occurred because of theGPTP workshop, it is at least fair to say that GPTP has had a role in bringing to-gether some of the best and most creative evolutionary engineers and theorists on anannual basis in a comfortable environment for 3 days of intense discussion, ques-tions and speculation on an annual basis I hope that the field will continue to matureand that the Genetic Programming Theory and Practice Workshop will continue aslong as it continues to be useful
In conclusion, I would like to thank the generosity of its supporters and, in ular, The University of Michigan and the Center for the Study of Complex Systems
partic-In particular Rick Riolo’s role as midwife at GPTP’s birth and his quiet, steady role
as parent for its growth is very much appreciated by all of the attendees over theyears Thanks Rick!
Trang 14The work described in this book was first presented at the Tenth Workshop on netic Programming, Theory and Practice, organized by the Center for the Study ofComplex Systems at the University of Michigan, Ann Arbor, May 12–14, 2012 Thegoal of this workshop series is to promote the exchange of research results and ideasbetween those who focus on Genetic Programming (GP) theory and those who focus
Ge-on the applicatiGe-on of GP to various real-world problems In order to facilitate theseinteractions, the number of talks and participants was small and the time for dis-cussion was large Further, participants were asked to review each other’s chapters
before the workshop Those reviewer comments, as well as discussion at the
work-shop, are reflected in the chapters presented in this book Additional informationabout the workshop, addendums to chapters, and a site for continuing discussions byparticipants and by others can be found athttp://cscs.umich.edu/gptp-workshops/.The rest of this preface consists of two parts; (1) A brief summary of both theformal talks and of the informal talk during the scheduled and unscheduled discus-sions; and (2) acknowledgements of the many generous people and institutions whomade the GPTP-2012 workshop possible by their financial and other support
A Brief Summary of the Ideas from Talks and Talked About Ideas at GPTP-2012
As in the previous 10 springs, the 2012 workshop on Genetic Programming in ory and Practice (GPTP) was hosted by the Center of the Study of Complex Systems
The-of the University The-of Michigan The discussions at the tenth jubilee gathering wereparticularly cohesive and friendly and nevertheless constructive, creative, and deep.Hoping to repeat the success of the GPTP workshops of the previous years weplanned lots of time for discussions and made the workshop longer In 2012 it ranfrom Thursday morning till Saturday afternoon Debates were full of open self-reflection, critical progress review and committed collaboration
xiii
Trang 15Thanks to our generous sponsors we could invite three keynote speakers thisyear and open every day of the workshop with an insightful and inspiring story.
Thursday started with an address by Sean Luke on “Multiagent Systems and ing.” Professor Luke, from the Department of Computer Science at George Ma-
Learn-son University, has been an influential researcher is the fields of genetic ming and multiagent systems His insight and experience in these areas contributedgreatly to the workshop discussions about how to use genetic programming tosolve complex problems Friday began with a talk by Professor Seth Chandler on
program-“Evolving Binary decision trees that sound like law.” Chandler, professor of Law at
University of Houston, gave a remarkable and enlightening talk on applications ofgenetic programming in law Not only did Professor Chandler show how to use ge-netic programming to evolve boolean expressions that predict the outcomes of legalcases and therefore sound like true “law”, he also provided a revealing comparison
of GP-generated models with conventional approaches like decision trees, SVMsand NNs His insightful illustrations of advantages of GP in terms of model com-pactness, transparency and interpretability as well as the unanticipated applicationarea inspired many important discussions during and after the workshop Saturdayopened with Bill Worzel, then Chief Technology Officer of Everist Genomics, pro-
viding an “unkeynote address,” “A Random Walk through GP(TP).” As Bill said
“an unkeynote speaker will deliver a mostly retrospective talk, reflecting on whathas happened, and perhaps a bit on why it has happened—call it the historian’sview.” Bill was present at and he was instrumental in the creation of GPTP and hisperspective on the highlights from the past 10 years was educational, insightful, andentertaining
Fifteen chapters were presented this year by newcomers and natives of GPTP
on new and improved general purpose GP systems, analysis of problem and GPalgorithm complexity, new variation paradigms, massively distributed GP, symbolicregression benchmarks, model analysis workflows and many exciting applications.The practice of GP was presented this year in a wide range of areas—robotics,image processing, bioinformatics and cancer prognostics, games, control algorithmsdesign, stock trading, life sciences, and insurance law
An important change this year compared with previous workshops was a morevaried mix of different representations of GP individuals in presented systems Wemade a coordinated effort to expand the topics of practical applications of GP farbeyond GP symbolic regression for data fitting, and we think we achieved success.Important topics in general purpose GP were the focus of many papers this year:
• Evolutionary constraints, relaxation of selection mechanisms, diversity
preser-vation strategies, flexing fitness evaluation, evolution in dynamic environments,multi-objective and multi-modal selection (Spector, Chap.1; Moore, Chap.7;Hodjat, Chap.5; Korns, Chap.9; Kotanchek, Chap.13);
• Evolution in dynamic environments (Soule, Chap.2; Hodjat, Chap.5);
• Foundations of evolvability (see Moore (Chap.7) for co-evolution of variationoperators, Giacobini (Chap.4) for adaptive and self-adaptive mutation, Korns(Chap.9), Flasch (Chap.11) for parameter optimization);
Trang 16Preface xv
• Foundations of injecting expert knowledge in evolutionary search (see Moore,
Chap.7; Benbassat and Sipper, Chap.12; Hemberg, Chap.15; Harding,Chap.3);
• Analysis of problem difficulty and required GP algorithm complexity (Flasch
(Chap.11), albeit with empirical validation for symbolic regression); and
• Foundations in running GP on the cloud—communication, cooperation,
flex-ible implementation, and ensemble methods (Babak, Chap.5; Wagy, Chap.6;McDermott, Chap.14)
While GP symbolic regression was concerned with the same challenges as above,the additional focal points were:
• The need to guarantee convergence to solutions in the function discovery mode
(Korns, Chap.9);
• Issues on model validation (Castillo, Chap.10);
• The need for model analysis workflows for insight generation based on
gener-ated GP solutions—model exploration, visualization, variable selection, sionality analysis (Moore, Chap.7; Kotanchek, Chap.13);
dimen-• Issues in combining different types of data (Ritchie, Chap.8)
Another positive observation is that the existential discussions on whether GP candeclare success as a science have dissipated from GPTP The overall consensus isthat GP has found it’s niche as a capacious and flexible scientific discipline, attract-ing funding, students, and demonstrating measurable successes in business Fourcompanies using GP-based technology as their competitive advantage were rep-resented among GPTP-2012 participants—Genetics Squared (cancer prognostics),Genetic Finance (stock trading), Evolved Analytics (plant and research analytics),and Machine Intelligence (image processing)
It looks like focus has shifted from being satisfied to generate beneficial parisons of GP with other disciplines (e.g GP symbolic regression with machinelearning, see “do we have a machine learning envy?” in GPTP-2010) towards amore productive search for high-impact problems solvable with GP in various yet-to-be-conquered application areas, and massive popularization of GP
com-An increasing gap between theory and practice of GP undoubtedly remains
an issue We doubt that this gap will ever be closed Theoretical analysis of GPsearch performance in impossible without heavy constraints on the application area,representation, genotype-phenotype mapping, initialization, selection and variationmechanisms First results were obtained last year for two simple problems (Neuman
et al., 2011) The main challenge here is to make the analyzed problems as realistic
as possible The fact that all GP practitioners are aware of the countless number ofsmall and big hacks that have made their GP algorithms considerably more effectiveadds to the staggering complexity of theoretical analysis of GP search At this point
in time the search for tight bounds on computational complexity for real problemsseems intractable We believe that attracting as many as possible hobbyists and in-terdisciplinary scientists to GP discipline, coupling research with other disciplineslike fundamental computer science, mathematics, system biology, and a more sys-tematic approach to GP can help bridge the gap between theory and practice
Trang 17Last year we stated that “symbolic regression and automated programming arejust the two ends of a continuum of problems relevant for genetic programming:Symbolic Regression> Evolution of executable variable length structures > Au-
tomatic Programming And while the ‘simplest’ application of GP to data fitting
is well studied and reasonably understood, more effort must be put into problemswhere a solution is a computer program,” (Vladislavleva et al., 2011) In response
to this quest GPTP-2012 presented systems where GP individual was an sql-query(Spector, Chap.1), an image filter (Harding, Chap.3), a power control algorithm(Hemberg, Chap.15), a game board evaluation function (Sipper, Chap.12), a legal-case decision outcome (Chandler1), a stock-trading rule-set (Hodjat, Chap.5), arobot micro-controller (Soule, Chap.2), and a gene-expression classifier (Moore,Chap.7) Such variety of representation could be an indication that we are slowlybut steadily moving along the “Symbolic Regression> Evolution of executable
variable length structures> Automatic Programming” path in the right direction.
We hope to solicit more work on evolving executable, variable length, structures
in future workshops and facilitate understanding of missing mechanisms for using
GP for automatic programing GP shines in problems in which there is no singleoptimal solution is desired but rather a large set of alternative and competing localoptima Effective exploration of these optima in dynamic environments is perhapsthe biggest strength of GP
The idea to keep in mind are that many complex problems are modal and tosolve them with GP we must relax selection mechanisms How to do selection in
a complicated dynamic environment where we never get enough information was,probably, one of the most popular questions at GPTP-2012
• Thomas Helmuth and Lee Spector (Chap.1) suggested that evolving programswith tags is one of the most expressive and evolvable ways to evolve modu-lar programs, because tag matching implies inexact naming of individuals, andhence, more flexible selection
• Soule (Chap.2) addressed the problem of evolving cooperation and nication of robots online He suggested that a hierarchical approach seems to
commu-be crucial for real-time learning at various time scales, and hierarchy is a form
of niching His chapter on designing inexpensive research robots to test board real-time evolutionary approaches has also contributed to another impor-tant goal addressed by many speakers at GPTP-2012—popularization of GP inother application areas, in this case—in robotics
on-• Hodjat and Shahrzad (Chap.5) proposed an age-varying fitness estimation tion for distributed GP for problems where exact fitness estimation is unattain-able, e.g for building reliable stock trading strategies at long time scales
func-• Harding et al (Chap.3) considered a flexible developmental representation—CGP to evolve impressive filters for object tracking in video using only limitedset of training cases
1 From an unpublished keynote address made at GPTP X (2012)
Trang 18Preface xvii
• Wagy et al (Chap.6) presented a flexible distributed GP system incorporatingmany relaxations to evaluation and selection mechanisms, e.g data binning andisland models
• Moore et al (Chap.7) employed multi-objective Pareto-based selection withfitness and model size, as objectives in the computational evolution system foropen-eded analysis of complex genetic diseases
• Wagy et al (Chap.6) use an archive layering strategy as a means to maintaindiversity in a massive scale GP system, EC-Star Evolution here also takes aform of niching2where individuals are layered by a MasterFitness criterion, akind of fidelity measure, reflecting the proportion of fitness cases against whichindividuals have been evaluated already
• Korns (Chap.9) presented complexity-accuracy selection niched per model age
as a baseline GP symbolic regression algorithm
• Flasch and Bartz-Beielstein (Chap.11) provided empirical analysis of objective and relaxed multi objective selection for problems of increased com-plexity and demonstrated once again the undeniable advantages of niching percomplexity and age for more effective search in GP symbolic regression
single-• Kotanchek et al (Chap.13) called for using as many competing objectives aspossible, and varying them during the evolutionary search The authors hypoth-esized that niching-based selection is the number one resolution for diversitypreservation and effective exploration of complicated search spaces in dynamicenvironments
When considering dynamic environments, inexact selection is directly related toissues of evolvability and open-ended evolution The latter was addressed directly
in several ways this year:
• Giacobini et al (Chap.4) introduced adaptive and self-adaptive mutation based
of Levy flights as a flexible variation operator Self-adaptive mutation is cially applicable to problems where length of evolutionary search is unknownupfront, and it is impossible to hardcode an optimal balance between explo-ration at the beginning of the search and exploitation towards the end It seemsthat flexibly scaled massively distributed GP might benefit dramatically fromthe proposed self-adaptive mutation paradigm
espe-• Moore et al (Chap.7) have been facilitating evolvability and open-ended lution by designed injection of expert knowledge into the evolutionary search
evo-• Benbassat et al (Chap.12) analyzed the same strategy of injecting domainknowledge for effective evolution of GP-based game players albeit with (nat-urally) less conclusive results They discovered that for some games domainknowledge injection was definitely advantageous while for others not, illustrat-ing the trade-off between flexibility (little domain knowledge) and specializa-tion (a lot of domain specific knowledge)
2 By niching everywhere in the chapter we mean speciation leading to independent selection out any fitness modifications like in fitness sharing.
Trang 19with-Another topic related to evolvability is application of GP to problems with verydifferent data sources Ritchie et al (Chap.8) explored the problems with meta-dimensional analysis of phenotypes, the Analysis Tool for Heritable and Environ-mental Network Associations The authors pled for solving issues with data integra-tion in disease heritability research—the need for methods handling multiple datasources, multiple data types, and multiple data sets.
We were glad to witness once again the collaborative spirit of GPTP Many openquestions of GPTP-2011 were addressed this year For example, the need for dis-tributed evolution was answered in three chapters on GP system design targeted atmassive distribution on a cloud (from 1,000 to 700,000 nodes) and generated a lot
of debate Island population model was considered to be one of the key strategiesfor flexible distributed evolution However, McDermott et al (Chap.14) showedthat the classical island model is not optimal for running GP on the cloud due tothe lack of elasticity and robustness The chapter raises insightful questions on thedesign of flexible evolution and provides initial experimental results comparing dis-tributed and non-distributed design, flexible centralized vs decentralized, vs hard-coded, and static vs dynamic population structure Perhaps the most intriguing andarguably most applicable to elastic computation is decentralized dynamic heteroge-neous GP design where population islands may differ in selection criteria, trainingdata, GP primitives, the number of nodes can increase or decrease dynamically, andthe system is robust toward communication failures between nodes
Another design for a massive scale distributed GP system employing hub andspoke network topology is the EC-Star GP system presented by Wagy et al (Chap.6).The system is characterized by massive distribution capacity over come-and-go vol-unteer nodes, it’s robustness, scalability and its particular applicability to time seriesproblems with a extremely high number of fitness cases (e.g in stock trading), whencombined with age-fitness evaluation described in Hodjat et al (Chap.5)
From the general questions raised during discussions at GPTP-2012 we wouldlike to distinguish the following:
• What are problems where solutions is a computer program? How to steer GP
towards evolving programs?
• Can an algorithm evolved by GP learn during its execution?
• How to overcome the inherent problem of search space non-smoothness which
emerges from the combination of representation and genetic operators? How
to change the representations and variation mechanisms to allow minor tions? Is it necessary?
adapta-• How to optimally exploit and expand the concept of simple geographies?
• Maybe we should populate environments with subsets of training data?
• Should we pursue efficient strategies for parameter tuning or develop
self-adaptive parameter servings?
• How to strike a balance between exploration and exploitation in open-ended
evolution?
• How to seamlessly integrate different types of data structures?
Trang 20Preface xix
• If the goal of many problems we are attempting to solve is understanding of
underlying process, what are innovative post processing methods for analysisand final selection of GP solutions?
• Are diversity preservation and niching and expert knowledge sufficient for
open-ended evolution?
• When solution accuracy is the goal, how to build self-correcting systems with
built-in quality insurance?
• How to exploit modern architectures to run GP?
• How to characterize problems where either static or dynamic, centralized or
de-centralized, homogenous or heterogeneous island models are beneficial fordistributed GP?
• How many runs are enough to compare various algorithm setups?
• How to make hierarchical behavior in multi agent systems emerge rather than
hard-code it?
• How to learn in general without too much reinforcement?
• How to enable supervised learning with very few training examples?
• How to do selection in environments where we never have enough information?
• What unites all methodologies we use for flexing the fitness evaluation and
selection strategies?
• How to facilitate cultural propagation of GP to other disciplines? What is the
strategy for bringing what we do to people who can benefit from it but do notknow about it?
We are grateful to all sponsors and acknowledge the importance of their tributions to such an intellectually productive and regular event The workshop isgenerously founded and sponsored by the University of Michigan Center for theStudy of Complex Systems (CSCS) and receives further funding from the followingpeople and organizations: Michael Korns, John Koza of Third Millenium, BabakHodjat of Genetic Finance LLC, Mark Kotanchek of Evolved Analytics and JasonMoore of the Computational Genetics Laboratory of Dartmouth College
con-We would like to thank all participants for another wonderful workshop con-Webelieve GPTP do bring a systematic approach to understanding and advancing GP
in theory and practice and look forward to the GPTP-2013
Acknowledgments
We thank all the workshop participants for making the workshop an exciting andproductive 3 days In particular we thank the authors, without whose hard work andcreative talents, neither the workshop nor the book would be possible We also thankour three keynote speakers, Sean Luke, Seth Chandler and Bill Worzel
The workshop received support from these sources:
• The Center for the Study of Complex Systems (CSCS);
• John Koza, Third Millennium Venture Capital Limited;
Trang 21• Michael Korns;
• Mark Kotanchek, Evolved Analytics;
• Jason Moore, Computational Genetics Laboratory at Dartmouth College;
• Babak Hodjat and Genetic Finance LLC
We thank all of our sponsors for their kind and generous support for the workshopand GP research in general
A number of people made key contributions to running the workshop and ing the attendees while they were in Ann Arbor Foremost among them was SusanCarpenter, who made GPTP X run smoothly with her diligent efforts before, duringand after the workshop itself After the workshop, many people provided invalu-able assistance in producing this book Special thanks go to Kadie Sanford, who did
assist-a wonderful job working with the assist-authors, editors assist-and publishers to get the bookcompleted despite the many obstables, small and large Courtney Clark and MelissaFearon from Springer provided invaluable advice and editorial efforts, from the ini-tial plans for the book through its final publication Thanks also to Springer’s Latexsupport team for helping with various technical publishing issues
References
Vladislavleva et al (2011) Genetic Programming Theory and Practice IX Springer,
2011
Neumann, O’Reilly, and Wagner, (2011) “Computational Complexity Analysis of
Genetic Programming—Initial Results and Futre Directions”, Genetic ming Theory and Practice IX Springer, 2011.
Trang 221 Evolving SQL Queries from Examples with Developmental
Genetic Programming . 1
Thomas Helmuth and Lee Spector
2 A Practical Platform for On-Line Genetic Programming
for Robotics 15
Terence Soule and Robert B Heckendorn
3 Cartesian Genetic Programming for Image Processing 31
Simon Harding, J¨urgen Leitner, and J¨urgen Schmidhuber
4 A New Mutation Paradigm for Genetic Programming 45
Christian Darabos, Mario Giacobini, Ting Hu, and Jason H Moore
5 Introducing an Age-Varying Fitness Estimation Function 59
Babak Hodjat and Hormoz Shahrzad
6 EC-Star: A Massive-Scale, Hub and Spoke, Distributed Genetic
Programming System 73
Una-May O’Reilly, Mark Wagy, and Babak Hodjat
7 Genetic Analysis of Prostate Cancer Using Computational
Evolution, Pareto-Optimization and Post-processing 87
Jason H Moore, Douglas P Hill, Arvis Sulovari, and La Creis Kidd
8 Meta-Dimensional Analysis of Phenotypes Using the Analysis
Tool for Heritable and Environmental Network Associations
(ATHENA): Challenges with Building Large Networks 103
Marylyn D Ritchie, Emily R Holzinger, Scott M Dudek,
Alex T Frase, Prabhakar Chalise, and Brooke Fridley
9 A Baseline Symbolic Regression Algorithm 117
Michael F Korns
xxi
Trang 2310 Symbolic Regression Model Comparison Approach Using
Transmitted Variation 139
Flor A Castillo, Carlos M Villa, and Arthur K Kordon
11 A Framework for the Empirical Analysis of Genetic Programming System Performance 155
Oliver Flasch and Thomas Bartz-Beielstein
12 More or Less? Two Approaches to Evolving Game-Playing
Strategies 171
Amit Benbassat, Achiya Elyasaf, and Moshe Sipper
13 Symbolic Regression Is Not Enough: It Takes a Village to Raise
a Model 187
Mark E Kotanchek, Ekaterina Vladislavleva, and Guido Smits
14 FlexGP.py: Prototyping Flexibly-Scaled, Flexibly-Factored
Genetic Programming for the Cloud 205
James McDermott, Kalyan Veeramachaneni, and Una-May O’Reilly
15 Representing Communication and Learning in Femtocell Pilot
Power Control Algorithms 223
Erik Hemberg, Lester Ho, Michael O’Neill, and Holger Claussen
Index 239
Trang 24Thomas Bartz-Beielstein is Head of the CIplus Research Center and Professor of
Computer Science at Cologne University of Applied Sciences, Germany
e-mail:thomas.bartz-beielstein@fh-koeln.de
Amit Benbassat is a graduate student in the Computer Science Department at
Ben-Gurion University, Israel, e-mail:amitbenb@cs.bgu.ac.il
Flor A Castillo is a Scientist in the Performance Materials R&D group of the Dow
Chemical Company, e-mail:facastillo@dow.com
Prabhakar Chalise is a Research Assistant Professor at the University of Kansas
Medical Center, USA, e-mail:pchalise@kumc.edu
Holger Claussen is head of the Autonomous Networks and Systems Research
Department at Bell Labs, Alcatel-Lucent in Dublin, Ireland
Christian Darabos is a postdoctoral research fellow in the Computational Genetics
Laboratory of the Geisel School of Medicine at Dartmouth College, USA, e-mail:
christian.darabos@dartmouth.edu
Scott M Dudek is a software developer in the Center for Systems Genomics at the
Pennsylvania State University, USA, e-mail:sud23@psu.edu
Achiya Elyasaf is a Ph.D student in the Computer Science Department at
Ben-Gurion University of the Negev, Israel, e-mail:achiya@cs.bgu.ac.il
Oliver Flasch is a Ph.D student in the Computer Science Department at Cologne
University of Applied Sciences, Germany, e-mail:oliver.flasch@fh-koeln.de
Alex T Frase is a software developer in the Center for Systems Genomics at the
Pennsylvania State University, USA, e-mail:alex.frase@psu.edu
xxiii
Trang 25Brooke Fridley is an Associate Professor at the University of Kansas Medical
Center, USA, e-mail:bfridley@kumc.edu
Mario Giacobini is leader of the Computational Epidemiology Group of the
Department of Veterinary Sciences and of the Complex Unit of the MolecularBiotechnology Center of the University of Torino, Italy
e-mail:mario.giacobini@unito.it
Simon Harding founded Machine Intelligence Ltd to solve industrial applications
using Genetic Programming, and previously was a researcher at the DalleMolle Institute for Artificial Intelligence (IDSIA), e-mail:simon@idsia.ch,simon@machineintelligence.co.uk
Robert B Heckendorn is an Associate Professor in the Computer Science
Department at the University of Idaho, USA, e-mail:heckendo@uidaho.edu
Thomas Helmuth is a graduate student in the Computer Science Department at the
University of Massachusetts, Amherst, MA, USA, e-mail:thelmuth@cs.umass.edu
Erik Hemberg is a post-doctoral researcher in the Natural Computing Research
and Applications group, University College Dublin, Ireland
Douglas P Hill is a software engineer in the Institute for Quantitative Biomedical
Sciencesat Dartmouth Medical School, USA, e-mail:douglas.hill@Dartmouth.edu
Lester Ho is a research engineer in the Autonomous Networks and Systems
Research Department at Bell Labs, Alcatel-Lucent in Dublin, Ireland
Babak Hodjat is co-founder and chief scientist at Genetic Finance LLC, in San
Francisco, CA, USA, e-mail:babak@geneticfinance.net
Emily R Holzinger is a graduate student in the Human Genetics Program at
Vanderbilt University, USA, e-mail:emily.r.holzinger@vanderbilt.edu
Ting Hu is a postdoctoral researcher at the Geisel School of Medicine, Dartmouth
College, USA, e-mail:ting.hu@dartmouth.edu
La Creis Kidd is an Associate Professor of Pharmacology and Toxicology at the
University of Louisville, USA, e-mail:lrkidd01@louisville.edu
Arthur K Kordon is Advanced Analytics Leader in the Advanced Analytics
Group within the Dow Business Services of The Dow Chemical Company
e-mail:akordon@dow.com
Michael F Korns is Chief Technology Officer at Freeman Investment Management,
Henderson, Nevada, USA, e-mail:mkorns@korns.com
Mark E Kotanchek is a CEO and Founder of Evolved Analytics LLC
e-mail:mark@evolved-analytics.com
Trang 26Contributors xxv
J ¨urgen Leitner is a PhD candidate in robotics and machine learning at the Dalle
Molle Institute for Artificial Intelligence (IDSIA) and the Universit`a della SvizzeraItaliana (USI), e-mail:juxi@idsia.ch
James McDermott is a Research Fellow in the Evolutionary Design and
Optimization group, Computer Science and Artificial Intelligence Laboratory,Massachusetts Institute of Technology, USA, e-mail:jmmcd@csail.mit.edu
Jason H Moore is the Third Century Professor of Genetics and Director of the
Institute for Quantitative Biomedical Sciences at Dartmouth Medical School, USA,e-mail:Jason.H.Moore@Dartmouth.edu
Michael O’Neill is Director of the Natural Computing Research and Applications
group, University College Dublin, Ireland, e-mail:m.oneill@ucd.ie
Una-May O’Reilly is leader of the Evolutionary Design and Optimization
Group and Principal Research Scientist at the Computer Science and ArtificialIntelligence Laboratory (CSAIL), Massachusetts Institute of Technology, USA,e-mail:unamay@csail.mit.edu
Rick Riolo is Director of the Computer Lab and Research Professor in the Center
for the Study of Complex Systems at the University of Michigan, USA
e-mail:rlriolo@umich.edu
Marylyn D Ritchie is an Associate Professor of Biochemistry and Molecular
Biology at the Pennsylvania State University, USA
e-mail:marylyn.ritchie@psu.edu
J ¨urgen Schmidhuber is Director of the Swiss Artificial Intelligence Lab IDSIA,
Professor of Artificial Intelligence at the University of Lugano, Switzerland, andProfessor at SUPSI, e-mail:juergen@idsia.ch
Hormoz Shahrzad is principle researcher and platform architect at Genetic
Finance LLC, in San Francisco, CA, USA, e-mail:hormoz@geneticfinance.net
Moshe Sipper is a Professor of Computer Science at Ben-Gurion University,
Israel, e-mail:sipper@cs.bgu.ac.il
Guido Smits is a Principal Research Scientist at Dow Benelux BV in the
Netherlands
Terence Soule is an Associate Professor in the Computer Science Department at
the University of Idaho, USA, e-mail:tsoule@cs.uidaho.edu
Lee Spector is a Professor of Computer Science in the School of Cognitive Science
at Hampshire College, Amherst, MA, USA, e-mail:lspector@hampshire.edu
Arvis Sulovari is a graduate of Dartmouth College and research assistant in the
Computational Genetics Laboratory at Dartmouth
Trang 27Kalyan Veeramachaneni is a post-doctoral associate at the Computer Science
and Artificial Intelligence Laboratory, MIT He received his Ph.D in ElectricalEngineering from Syracuse University in December, 2009
Carlos M Villa is a Senior Research Scientist in the Reaction Engineering group
of Engineering and Process Sciences of The Dow Chemical Company
e-mail:cmvilla@dow.com
Ekaterina Vladislavleva is a Chief Data Scientist and Partner at Evolved
Analytics, U.S., Managing Director at Evolved Analytics Europe, Belgium andpart-time associate member at CIplus Research Center at Cologne University ofApplied Sciences, Germany, e-mail:katya@evolved-analytics.com
Mark Wagy is a software engineer in the Evolutionary Design and Optimization
group, Computer Science and Artificial Intelligence Laboratory, MassachusettsInstitute of Technology and will be a graduate student at the University of Vermont
in the fall, e-mail:mark.wagy@gmail.com
Trang 28Chapter 1
Evolving SQL Queries from Examples
with Developmental Genetic Programming
Thomas Helmuth and Lee Spector
Abstract Large databases are becoming ever more ubiquitous, as are the
opportunities for discovering useful knowledge within them Evolutionary tion methods such as genetic programming have previously been applied to severalaspects of the problem of discovering knowledge in databases The more specifictask of producing human-comprehensible SQL queries has several potential ap-plications but has thus far been explored only to a limited extent In this chapter
computa-we show how developmental genetic programming can automatically generate SQLqueries from sets of positive and negative examples We show that a developmentalgenetic programming system can produce queries that are reasonably accurate whileexcelling in human comprehensibility relative to the well-known C5.0 decision treegeneration system
Key words: Data mining, Classification, SQL, Push, PushGP
Introduction
In the emerging era of “big data,” vast amounts of data are available in many kinds
of databases Unfortunately, many users who have access to this data are unable
to use it effectively because they do not know how to extract relevant, conciseand comprehensible features or summaries of the data; that is, they do not knowwhat queries to formulate in order to discover novel and useful aspects of the data
Trang 29This issue can be addressed in part by a system that takes positive and negativeexample tuples—which is generally easy for users to provide—and returns concise,comprehensible SQL queries that classify the provided tuples in simple andpotentially interesting ways.
The creation of queries from examples can be thought of as a data miningclassification problem, which is often one task within a larger “knowledge discovery
in databases” process (Freitas 2002) In this task the objective is to create a hensible and interesting query that correctly classifies the given examples In manycases we have no reason to expect there to be a simple query that perfectly classifiesthe examples, but we would nonetheless like to create a reasonably simple querythat both does a good job at classifying the examples and is concise enough to beeasily interpreted by the user
compre-To make the general problem more concrete, we seek a system that takes as
inputs a database D and training example tuples E = E+∪ E − where E ⊆ D and
E+∩E − = /0 Here, E+is the set of positive examples, and E −is the set of negative
examples The goal of the system is to discover a concise and potentially interesting
query Q such that E+⊆ Q(D) and E − ∩ Q(D) = /0.
We have developed a system called Query From Examples (QFE) that takes the
set of examples E and searches for a query Q that satisfies the above properties.
It does this by means of developmental genetic programming In QFE, each program
P creates (or “develops”) a query Q Pthat is then evaluated on how well it correctly
classifies the given example tuples E.
In contrast to other approaches to the production of database queries with GP(see below), this form of developmental GP allows QFE to use standard programrepresentations and genetic operators, along with standard population and evolu-tionary control parameters The only change required to use this approach in con-junction with most GP systems is to include new developmental functions in thesystem’s function set The developmental approach makes it easy to implement sys-tems like QFE on top of existing GP systems and thereby to take advantage ofadvances in the general state of the art of GP In addition, it may make it easier toevolve queries of arbitrary structure, thereby enhancing the generality of the systemfor a wide range of applications
In the work described in this chapter we ran QFE on a standard data miningclassification task and compared its results to those given by the decision tree clas-sifier C5.0 We find that although QFE does not produce quite as accurate a classifier
as C5.0, the classifier that it produces is more concise and comprehensible than theone produced by C5.0 We therefore believe that developmental GP is competi-tive with, and in some ways superior to, other modern data mining systems on thecreation of classifiers
The remainder of the chapter is structured as follows The next section describeswork that others have done evolving SQL queries Section “Evolving Queries fromExamples” describes our QFE system and its implementation Our experiments andresults are given in sects “Experimental Design”and “Results” Finally, we discusslimitations of QFE, possible improvements to QFE (including generalizations thatQFE makes possible but that competing approaches would not), and our generalconclusions
Trang 301 Evolving SQL Queries from Examples with Developmental Genetic Programming 3
Related Work
A variety of research has been conducted that uses GP either for the creation ofqueries (da Silva and Thomas 2010;Acar and Motro 2005) or for data mining (Fre-itas 2002,1997;Ishida and Pozo 2002;Doucette et al 2012;Veeramachaneni et al
2012, among many others) Because this literature is quite voluminous and varied
we will comment specifically only on those systems most closely related to QFE.Castro da Silva and Thomas (2010) directly evolve queries as individuals withthe goal of generating queries for inexperienced SQL users In order to ensure thatevolved queries are syntactically correct they implement numerous non-standardgenetic operators to combine and mutate individuals This approach requires sig-nificant re-design of any existing GP system and, we would argue, limits the sys-tem’s generality Interestingly, this system seems to be the only prior work in whichqueries are allowed to include joins across tables, leaving the joining attribute up toevolution
Acar and Motro (2005) frame their work as trying to provide an alternativeequivalent query to a given query by creating the alternative using the results ofthe original as positive examples Although their motivation is different from ours,the resulting system has many similarities Their method assumes that the sets ofpositive and negative examples cover the entire database instead of a small sub-set of it The user must provide the entire set of example tuples that are in thedatabase, which is probably impossible without using an a priori query to fetchthem The given system evolves actual queries as individuals, but can only handlequeries expressible as trees of relational algebra expressions
Freitas describes a GP system that evolves programs that can be interpreted asSQL queries to be used in the data mining tasks of classification and generalizedrule induction (Freitas 1997) Individuals are represented as trees that directly corre-spond to WHERE clauses of queries Unlike QFE, this work allows for the evolution
of non-binary classifiers via niches that correspond to classes of the goal attribute.This paper was pioneering insofar as it introduced the idea of evolving SQL queriesbut it presents no experiments or results, and it does not make clear how one can dealwith practical issues such as the choosing of constants, the design of an appropri-ate fitness function, the alterations that must be made to standard genetic operators,etc Because there are no results one cannot judge the system with respect to queryaccuracy, comprehensibility, conciseness, and time Additionally, this approach islimited (unlike the developmental approach that we present below) to the produc-tion of queries over a single table with WHERE clauses that can be expressed astrees Freitas has continued to produce a great deal of significant related work butnot, so far as we are aware, additional work on the use of GP to evolve SQL queries.GPSQL is a data mining system that uses grammar genetic programming(different from grammatical evolution) to create SQL queries for classification(Ishida and Pozo 2002) This GP system uses individuals that are composed ofgrammar-based derivation trees, where the grammar underlying each tree allowsfor problem-specific SQL queries to be formed The use of derivation trees allowsgenetic operators to replace a node in a tree only with a node that is generated us-ing the same production rule from the grammar, meaning that the resulting children
Trang 31must be syntactically correct In this respect the system is somewhat like typed GP (Montana 1995) with a very large number of types, one for each produc-tion rule Unfortunately, this means that each problem requires an extensive BNFgrammar to be defined by the user The BNF grammars described in this work appear
strongly-to be highly specialized strongly-to specific problems, defining how each condition is formedand with what values an attribute may be compared
Evolving Queries from Examples
We used the PushGP genetic programming system to evolve individuals that createqueries Each individual is a program that manipulates a state that can be interpreted
as a SQL query after the program terminates This type of GP system, in whichindividuals create executable structures via state manipulation, and in which theresulting structures are subsequently executed (e.g as database queries) to produce
desired outputs, is known as developmental GP (Gruau 1994;Koza et al 1999)
We assign fitnesses to individuals based on how many of the positive and negativeexample data points the queries that they produce return when run on the database
Push and PushGP
PushGP is in many respects a generic GP system except that its individuals arerepresented in the Push programming language (Spector 2001;Spector et al 2005).Push is a stack-based language in which instructions fetch arguments from stacksand return results to stacks; each type has its own stack In Push, programs consist
of nested lists of intermingled instructions and literals Strongly-typed instructionsare able to either retrieve the correctly typed arguments if they are available, oract as “no-ops” (and do nothing) if they are not Push has been implemented inmany languages; this work uses the Clojure implementation, which may be freelydownloaded at the Push project page.1
Push allows for many different types to be used within one program, each
of which has its own stack Common types such as integers, floats, strings, andbooleans are often used, as are “code” and “exec” types that allow for the evolu-tion of self-modifying programs and novel control structures Additional problem-specific types can be added when necessary For evolving queries, we have addedstacks for the SELECT, FROM, and WHERE clauses of an SQL query, although weprimarily use the “where” stack along with the standard integer and string stacks.Table1.1contains the instructions used in the runs reported below
Programs must also contain literals of the data types that they use When the Pushinterpreter encounters a literal within a program it simply pushes it onto the stack
of the appropriate type For the evolution of queries we only use literals that come
1 http://hampshire.edu/lspector/push.html
Trang 321 Evolving SQL Queries from Examples with Developmental Genetic Programming 5
Table 1.1: Instructions used in our PushGP runs
Stack Instructions
integer add, sub, mult, div, mod, stackdepth, dup, swap, rot
string length, stackdepth
where condition from stack, condition from index, condition
distinct from index, condition from pos ex, condition from neg ex, and, or
-from ephemeral random constants (ERCs), which are random number or string erators that produce constant literals when they are selected for inclusion in newcode For integer literals, we include two ERCs: one that produces integers uni-formly in the range[0,100,000), and one that uses more of a logarithmic scale, in
gen-that it chooses a range uniformly from[0,10), [0,100), [0,1,000), [0,10,000), and [0,100,000) and then chooses a constant uniformly from within the chosen range.
The logarithmic ERC makes small integers, which may be important for use inWHERE clause conditions, more common than with the ERC over the entire range
[0,100,000) Additionally, we include a string ERC that produces strings between 1
and 10 characters long that may include any uppercase or lowercase letters as well
as any numerical digits
Developmental GP
As described above, QFE creates an SQL WHERE clause by manipulating a statethrough developmental instructions The state is kept and manipulated on the wherestack of the Push interpreter state The where stack can have any number of itemspushed onto it, where each item is either a single condition on one attribute or anynumber of conditions joined by the logical operators AND, OR, and NOT Eachcondition may be over any attribute of the table and is constructed as describedbelow Examples of possible items on the where stack include (age > 37),
SQL query from a Push program, QFE runs the program and takes the top item onthe where stack and uses it as the WHERE clause of the query For our experiments,the SELECT clause always just has “*”, and the FROM clause references the onlytable in the database By adding new instructions, we could generalize QFE so that itcould evolve queries over a database with multiple tables by evolving the SELECTand FROM clauses, as discussed in sect “Conclusions and Future Work”
Trang 33The instructions that create and connect conditions are given as the where stackinstructions in Table 1.1 The instructions where condition from stack,
these instructions creates a condition by using three literals off of the integer andpossibly string stacks where condition from stack first pops an integeroff of the integer stack and uses it as an index into the attributes of the table,taken modulo the number of attributes in the table A second integer is popped andtaken modulo 6 as an index to decide which comparator will be used, from the set
{=,<,>,<=,>=,<>} Finally, a value is popped off of whichever stack is of the
same type as the chosen attribute and is compared to that attribute to create the dition The condition composed of the chosen attribute, comparator, and constant
con-is pushed onto the where stack It should be noted that if thcon-is instruction or any ofthe others do not find the number of arguments they require on the stacks, they act
as no-ops
The two instructions where condition from index and where
condi-tion from stack, except in the way they choose a constant to constrain the dition Likewhere condition from stack, each of these instructions usesthe first two integers on the integer stack to determine the attribute and comparator touse for the condition However,where condition from indexandwhere -
stack, but from a tuple indexed in the database Both of these instructions use thethird integer on the integer stack as an index to a tuple from the relevant table; then,the tuples’ value of the selected attribute is used as the condition’s constant The onlydifference between these two instructions is that where condition from -
the table for that attribute For example, the entire table may have many tuples wherethe value of the attributeageis 35; these are all kept by the first instruction, wherethe second only indexes in a list of distinct ages, and therefore only has one indexwhereageis 35 or any other age for that matter The first instruction can be thought
of as giving weight to a particular value equal to the number of times it appears inthe database, while the second makes all values equally likely no matter how oftenthey occur
Our QFE runs use two other WHERE-condition creation instructions that act
examples E+or the negative examples E − instead of the entire table D These
in-structions,where condition from pos exandwhere condition from-neg ex, make it so that only values in the example tuples can be indexed to usefor the constant This bias may make it easier to create conditions that specificallyrelate to the positive or negative examples
Each of the above instructions creates a single condition and leaves it on the top
of the where stack The instructionswhere and,where or, andwhere notlow for arbitrarily connected conditions.where andandwhere ortake the top
Trang 34al-1 Evolving SQL Queries from Examples with Developmental Genetic Programming 7two items on the where stack, join them withAND orORrespectively, and pushthe result onto the where stack Similarly,where nottakes the top item on thewhere stack, putsNOTin front of it, and pushes the result onto the where stack.These instructions allow for arbitrary combinations of conditions to be formed.Though we implemented all three instructions, we foundwhere notto be more
of a hindrance than a help, since WHERE clauses tended to be clogged by nestedNOT calls that cancel each other out We therefore leavewhere notout of ourinstruction set for our experiments
tuples Q P (E) We use the F1score as a measure of fitness of the program The F1
score, developed to evaluate classification accuracy in information retrieval settings,
is the harmonic mean of precision and recall (Van Rijsbergen 1979) Precision andrecall are defined over true positives, false positives, and false negatives defined as
true positives = ||E+∩ Q P (E)||
true negatives = ||E − − Q P (E)||
f alse positives = ||E − ∩ Q P (E)||
f alse negatives = ||E+− Q P (E)||.
We can then define precision, recall, and F1score as
precision= true positives
true positives + f alse positives recall= true positives
true positives + f alse negatives
F1=2· precision · recall
precision + recall .
Some programs, when executed, result in problematic stack states or queries thatmust be handled separately One such degenerate case is when nothing is left onthe where stack of the Push state at the end of program execution In this case, theprogram has not produced a WHERE clause and is given a penalty fitness that is aworse fitness than will be given to any program that produces a non-empty WHEREclause A second degenerate case occurs when a query takes more time to run than isallowed by QFE These queries also receive a penalty fitness, worse than any querythat finishes running, but better than a query with an empty WHERE clause
Trang 35We initially predicted that overfitting of queries to the training examples would
be a problem, creating very large queries that only classify the training exampleswell and do poorly on unseen test data To combat this it would be possible to add
a parsimony term to the fitness function, scaling a query’s fitness based on howconcise it is This might force evolution to focus on sufficiently simple queries,avoiding overfitting to the training data In practice, we found that QFE evolvedsufficiently simple queries without such a term This may be due to the boundsplaced on maximum Push program sizes or by other dynamics of the system thattend to favor concise queries In any event, we did not use a parsimony term for theruns described below
Database Use
Well-designed queries tend to be fast, but poorly designed queries can take a longtime to run Since GP tends to produce and test many strange and bad programswhile searching for good ones, fitness testing by running queries can be slow
We extended the implementation of fitness testing in a few ways to speed up theevolutionary process
Some GP implementations cache the fitnesses of evaluated programs so that ifthe same program is evaluated more than once, the fitness can be quickly retrievedand not re-calculated from scratch For the problem of evolving programs that createqueries we found that many times there are different programs that produce the samequery We altered the caching so that the system caches the fitness of a query instead
of the fitness of a program In this way, different programs that produce the samequery can use the same cached fitness value
Even with these improvements, some queries run for far too long, significantlyslowing down QFE These anomalous queries, if left to run until finished, woulddominate the time QFE takes to evolve a query We decided to give each query only
a certain length of time (0.5 s) to run, after which it is cut off and given a penaltyfitness value We found this limit to allow most queries to finish without lettingextremely slow queries slow down a run
Experimental Design
To determine how well QFE finds queries that correctly classify results we compareQFE with C5.0,2a modern data mining classification system that produces decisiontrees that classify data in a way similar to the queries produced with QFE C5.0 is de-rived from the widely-used C4.5 system (Quinlan 1993) C5.0 creates decision treesthat identify patterns in the training examples Each decision tree’s internal nodes
2 C5.0 is available at http://rulequest.com/see5-info.html
Trang 361 Evolving SQL Queries from Examples with Developmental Genetic Programming 9represent boolean tests over a tuple’s attributes and leaves give the predicted class
of the given tuple A decision tree can be used to classify examples or to identifypatterns in a way that make sense to a human user Even though decision trees differfrom SQL queries in many aspects, they offer a similar enough alternative to com-pare with QFE’s evolved queries We will compare these systems on the accuracy,conciseness, and time metrics presented below
Table 1.2: Attributes for the adult data set
Attribute Type Values
workclass string 9 fnlwgt integer 21648 education string 16 education num integer 16 marital status string 7 occupation string 15 relationship string 6
capital gain integer 119 capital loss integer 92 hours per week integer 94 native country string 42 greater 50k string 2
For our experiments, we used the Adult Data Set from the UC Irvine MachineLearning Repository (Frank and Asuncion 2010) This data set has a single tablewith 15 attributes and 32,561 tuples and is often used to test classification sys-tems Each tuple contains census data about a person, including six integer attributesand nine string attributes, where string attributes come from discrete sets of optionsand most integer attributes have wider ranges Table1.2gives the 15 attributes andhow many discrete values occur for each in the database Note that some tuples aremissing values for some string attributes, filled in by the string ‘?’
We tested QFE on a classification problem presented by the adult data set.This problem, which we call the 50k-Classification problem, requires a system topredict the final attribute of each tuple, which is whether the person represented
by that tuple makes more than $50,000 per year For QFE to evolve queries from
examples, we need as inputs a set of positive example tuples E+ and a set of
negative example tuples E − For this problem, the entire training database D is used as the example set E, with each tuple placed in E+ or E − based on the at-tributegreater 50k Additionally, the evolved query is not allowed to accessthe attributegreater 50k, since otherwise the problem would be trivial PushGPparameters for the 50k-Classification problem are given in Table1.3 Simplifica-tion is a genetic operator unique to PushGP, in which sections from the individual’s
Trang 37program are randomly removed in an attempt to shorten the program without ering its fitness (Klein and Spector 2007) Since each simplification step requires afitness test, we only used one simplification step per simplification operator, whichlikely had a very small, if any, effect on the conciseness of evolved queries.
low-Table 1.3: The PushGP parameters used for the 50k-Classification problem.1 See ( Spector and Klein 2005 ) for information on trivial geography
Population Size 1000 Maximum Generations 150 Maximum Program Size 300 Crossover Probability 0.80 Mutation Probability 0.12 Simplification Probability 0.05 Reproduction Probability 0.03
Trivial Geography Radius 1 10 Node Selection Unbiased Fitness Function F1Score
We use a variety of metrics to evaluate queries produced by QFE and comparethem to the results produced by C5.0 We are primarily interested in measuring theaccuracy and conciseness of a query or decision tree and the time required by thesystem Our primary metric of accuracy is defined as
accuracy=true positives ||E|| +true negatives
We are interested not only in how well our evolved queries perform on the trainingexamples, but also how well they generalize to other data We present accuracyresults over both the example tuples, which we consider training data, and over a set
of test data The adult data set comes with separate training and testing sets, whereonly the training set is available to QFE and C5.0 during the creation of classifiers.For the queries we produce, conciseness is a count of the number of conditions
in the query’s WHERE clause; for decision trees produced by C5.0, we give thenumber of leaves in the decision tree Even though these measures of concisenessare not equivalent they do at least give an idea of the complexity of the results.Finally, we give a rough estimate of the time required to produce the results on amodest machine, though it should be noted that QFE is a rough proof of conceptimplementation whereas C5.0 has been highly optimized
Trang 381 Evolving SQL Queries from Examples with Developmental Genetic Programming 11
Results
While we have run QFE repeatedly we present here the results of just one sentative run On the 50k-Classification problem, a QFE run created the followingquery:
FROM adult
WHERE (((((education num >= 10) AND (marital status =
’Married-civ-spouse’)) OR (education num >= 15))AND (age >= 28)) OR (capital gain > 4787))
Of all the queries examined during the run, this query had the best fitness on thetraining data This query has found some interesting conditions that are good pre-dictors of whether or not a person makes more than $50,000 per year First of all, itreturns all tuples where(capital gain > 4787) It also returns tuples where
interesting description of people who make more than $50,000 per year ally, this query is easy to break apart into these sets of conditions, making it easilycomprehensible
Addition-Performance results for query (1.1) and the decision tree created by C5.0 aregiven in Table1.4 C5.0 gives slightly better accuracy and F1score results, but QFE
is close behind Interestingly, QFE’s accuracy increases between the training and testdata, where C5.0’s decreases This may indicate that C5.0 is overfitting the trainingdata more than QFE QFE took about 10 h to produce its query, whereas C5.0 tookless than 1 s Regardless of the optimizations that could be made to QFE, C5.0 iscertainly substantially faster With respect to conciseness, QFE produced a querywith five conditions that are easy to understand, as described above On the otherhand, C5.0 produced a decision tree with 124 leaves Even though this decision tree
is more accurate than query (1.1), it does not provide a concise summary of the data
in a way that is easily understood by humans
We must consider why the queries evolved by QFE are so concise despite therebeing no incentive for more concise queries in the fitness function Program sizes inPushGP bloat for most problems, and this is no exception; the mean program sizeincreased during the run that produced query (1.1) Throughout the run, the Pushprogram generating the best query tended to be larger than the average programsize; even so, the best found query remained relatively concise We believe the de-velopmental approach taken by QFE allows programs to bloat while using a smallnumber of developmental instructions, resulting in a small evolved query There isprobably also some evolutionary pressure against overly large queries, which may
be more likely to be degenerate
Trang 39Table 1.4: Performance measures for the 50k-Classification problem for the solution query (1.1 )
evolved by QFE and for the C5.0 decision tree E columns give measures over the training ples while T columns give measures over the test database which is unseen by the algorithms For
exam-descriptions of metrics, see Section 1
Conciseness 5 conditions 124 leaves Positives in Table 7841 3846 7841 3846 Negatives in Table 24720 12435 24720 12435 Tuples in Table 32561 16281 32561 16281 True Positives 5507 2703 5313 2490 True Negatives 21428 10814 23334 11612 False Negatives 2334 1143 2528 1356 False Positives 3292 1621 1386 823 Accuracy 0.8272 0.8302 0.8798 0.8662 Precision 0.6259 0.6251 0.7931 0.7516 Recall 0.7023 0.7028 0.6776 0.6474
F1Score 0.6619 0.6617 0.7308 0.6956
Conclusions and Future Work
We have presented a system called Query From Examples (QFE) that takes, as input,
a database and sets of positive and negative examples and produces, as output, anSQL query that characterizes the classification implied by the examples in a conciseand human-readable form We used developmental GP to implement QFE on top
of an existing GP system (PushGP) with little modification Compared to the known C5.0 decision tree system, QFE is substantially slower and slightly but notunreasonably less accurate On the other hand QFE can produce queries that are farmore concise and comprehensible to humans and that are expressed in the widelyunderstood and practically useful form of SQL queries For many conceivable ap-plications the latter criteria are of paramount importance; this argues for continuedexploration of approaches like that taken with QFE
well-Certainly the performance of QFE must be improved to support some kinds
of applications, but we are confident that evolutionary search times can be duced from hours to minutes through the use of modern hardware and straightfor-ward software optimizations This will enable many applications in which human-comprehensible insight about the structure of a data set has substantial value.Our current implementation of QFE has several limitations, but the developmental
re-GP approach will make it easy to remove many of these For example the currentimplementation does not allow for the evolution of queries that perform joins acrossmultiple tables, create projections by selecting specific attributes in the SELECTclause, use SQL’s GROUP BY or HAVING clauses, or use aggregate functions such
as COUNT() or AVG() But each of these capabilities could be provided simply bywriting additional developmental instructions; no other changes would have to be
Trang 401 Evolving SQL Queries from Examples with Developmental Genetic Programming 13made to Push program representations and no changes would have to be made tothe evolutionary algorithm or to the fitness assessment procedures Of course theaddition of such capabilities would change the evolutionary space, and it is possiblethat some of these changes detract from, rather than enhance, the system’s abil-ity to find good queries But the developmental framework makes it simple to addadditional query components and to conduct runs to explore their effects.
QFE should have no problem with a database that has an extremely large number
of tuples, as long as the example training set is not also extremely large We haveseen that QFE performs well on a problem in which the example set contained over30,000 tuples in the 50k-Classification problem If the example set were orders ofmagnitude larger, then QFE may take prohibitively long to run, particularly if in-dividual queries take a long time However, this problem could be ameliorated byusing a high-performance distributed system, conducting multiple fitness tests inparallel and also submitting queries to a parallel database server We performedour runs using a local SQLite database because it was the easiest to set up for ourproof-of-principle runs, and a parallel database server would speed things up dra-matically A different limitation may stem from example sets that are too selective.For example, if a database has a limited number of positive examples, there maynot be enough examples to accurately evolve a query that satisfies those examples
in a general way Nonetheless our work indicates that developmental GP has thepotential to contribute to the discovery and exploitation of knowledge in databases
in significant ways
Acknowledgements We thank Gerome Miklau for advice regarding databases and the UCI
Machine Learning Repository for use of the adult dataset; see http://archive.ics.uci.-edu/ml/index html This material is based upon work supported by the National Science Foundation under Grant
No 1017817 Any opinions, findings, and conclusions or recommendations expressed in this lication are those of the authors and do not necessarily reflect the views of the National Science Foundation.
pub-References
Acar AC, Motro A (2005) Intensional encapsulations of database subsets by genetic programming Tech Rep ISE-TR-05-01, Information and Software Engineering Department, The Volgenau School of Information Technology and Engineering, George Mason University, URL http://ise gmu.edu/techrep/2005/05-01.pdf
Doucette JA, McIntyre AR, Lichodzijewski P, Heywood MI (2012) Symbiotic coevolutionary genetic programming: a benchmarking study under large attribute spaces Genetic Program- ming and Evolvable Machines 13(1):71–101, DOI rmrmdoi:10.1007/s10710-011-9151-4, spe- cial Section on Evolutionary Algorithms for Data Mining
Frank A, Asuncion A (2010) UCI machine learning repository URL http://archive.ics.uci.edu/ml Freitas A (2002) A survey of evolutionary algorithms for data mining and knowledge discovery In: Ghosh A, Tsutsui S (eds) Advances in Evolutionary Computation, Springer-Verlag, chap 33,
pp 819–845, URL http://www.macs.hw.ac.uk/∼dwcorne/Teaching/freitas01survey.pdf Freitas AA (1997) A genetic programming framework for two data mining tasks: Classification and generalized rule induction In: Koza JR, Deb K, Dorigo M, Fogel DB, Garzon M, Iba H,