THE TOPOLOGY OF A NETWORK OFCHEMICAL REACTIONS the total number of reactions in the network, the number of substrates consumed by eachreaction, the number of products produced by e
Trang 1REVERSE ENGINEERING AND AUTOMATIC SYNTHESIS OF METABOLIC PATHWAYS FROM
OBSERVED DATA USING GENETIC
PROGRAMMING
SYMPOSIUM ON COMPUTATIONAL DISCOVERY OF COMMUNICATABLE
KNOWLEGDE
SUNDAY MARCH 25, 2001
CSLI STANFORD
John R Koza
Stanford Biomedical Informatics, Department of Medicine
Department of Electrical Engineering Stanford University, Stanford, California koza@stanford.edu
Trang 2Martin A Keane
Econometrics Inc., Chicago, Illinois
makeane@ix.netcom.com
FROM CHAPTER 1 OF GENETIC
PROGRAMMING III: DARWINIAN INVENTION AND PROBLEM SOLVING (KOZA, BENNETT,
ANDRE, KEANE 1999)
"Most techniques of artificial intelligence, machinelearning, neural networks, adaptive systems,reinforcement learning, or automated logic employspecialized structures in lieu of ordinary computerprograms
"These surrogate structures include if-then productionrules, Horn clauses, decision trees, Bayesian networks,propositional logic, formal grammars, binary decisiondiagrams, frames, conceptual clusters, concept sets,numerical weight vectors (for neural nets), vectors ofnumerical coefficients for polynomials or other fixedexpressions (for adaptive systems), genetic classifiersystem rules, fixed tables of values (as in reinforcementlearning), or linear chromosome strings (as in theconventional genetic algorithm)
Trang 3FROM CHAPTER 1 OF GENETIC
PROGRAMMING III CONTINUED
"Tellingly, except in unusual situations, the world'sseveral million computer programmers do not use any ofthese surrogate structures for writing computer programs
"Instead, for five decades, human programmers havepersisted in writing computer programs that intermix amultiplicity of types of computations (e.g., arithmetic andlogical) operating on a multiplicity of types of variables(e.g., integer, floating-point, and Boolean) Programmershave persisted in using internal memory to store theresults of intermediate calculations in order to avoidrepeating the calculation on each occasion when the result
is needed They have persisted in using iterations andrecursions They have similarly persisted for five decades
in organizing useful sequences of operations into reusablegroups (subroutines) so that they avoid reinventing thewheel on each occasion when they need a particularsequence of operations Moreover, they have persisted inpassing parameters to subroutines so that they can reusetheir subroutines with different instantiations of values.And, they have persisted in organizing their subroutinesinto hierarchies
Trang 4FROM CHAPTER 1 OF GENETIC
PROGRAMMING III CONTINUED
"All of the above tools of ordinary computerprogramming have been in use since the beginning of theera of electronic computers in the l940s Significantly,
none has fallen into disuse by human programmers Yet,
in spite of the manifest utility of these everyday tools ofcomputer programming, these tools are largely absentfrom existing techniques of automated machine learning,neural networks, artificial intelligence, adaptive systems,reinforcement learning, and automated logic
"On one of the relatively rare occasions when one ortwo of these everyday tools of computer programming isavailable within the context of one of these automatedtechniques, they are usually available only in a hobbledand barely recognizable form
"In contrast, genetic programming draws on the fullarsenal of tools that human programmers have founduseful for five decades It conducts its search for asolution to a problem overtly in the space of computerprograms
"Our view is that computer programs are the bestrepresentation of computer programs We believe that the
Trang 5search for a solution to the challenge of getting computers
to solve problems without explicitly programming themshould be conducted in the space of computer programs
Trang 6THE TOPOLOGY OF A NETWORK OF
CHEMICAL REACTIONS
the total number of reactions in the network,
the number of substrate(s) consumed by eachreaction,
the number of product(s) produced by each reaction,
the pathways supplying the substrate(s) (either fromexternal sources or other reactions in the network) to eachreaction,
the pathways dispersing each reaction's product(s)(either to other reactions or external outputs), and
an indication of which enzyme (if any) acts as acatalyst for a particular reaction
THE SIZING FOR A NETWORK OF CHEMICAL
REACTIONS
all the numerical values associated with the network(e.g., the rates of each reaction)
Trang 7OUR APPROACH
establishing a representation for chemical networksinvolving symbolic expressions (S-expressions) andprogram trees that can be progressively bred (andimproved) by means of genetic programming,
converting each individual program tree in thepopulation into an analog electrical circuit representingthe network of chemical reactions,
obtaining the behavior of the individual network ofchemical reactions by simulating the correspondingelectrical circuit,
defining a fitness measure that measures how wellthe behavior of an individual network matches theobserved time-domain data concerning concentrations offinal product substance(s), and
using the fitness measure to enable geneticprogramming to breed an improved population ofprogram trees
Trang 8FIVE DIFFERENT REPRESENTATIONS
Reaction Network: The blocks represent chemical
reactions and the directed lines represent flows ofsubstances between reactions
Program Tree: A network of chemical reactions
can also be represented as a program tree whose internalpoints are functions and external points are terminals.This representation enables genetic programming to breed
a population of programs in a search for a network ofchemical reactions whose time-domain behaviorconcerning concentrations of final product substance(s)closely matches observed data
Symbolic Expression: A network of chemical
reactions can also be represented as a symbolic expression(S-expression) in the style of the LISP programminglanguage This representation is used internally by the run
of genetic programming
System of Non-Linear Differential Equations: A
network of chemical reactions can also be represented as
a system of non-linear differential equations
Analog Electrical Circuit: A network of chemical
reactions can also be represented as an analog electrical
Trang 9circuit Representation of a network of chemical reactions
as a circuit facilitates simulation of the network's domain behavior
Trang 10time-ILLUSTRATIVE PROBLEM NO 1
Acylglycerol lipase (EC3.1.1.23), and
Triacylglycerol lipase (EC3.1.1.3)
2 intermediate substances
sn-Glycerol-3-Phosphate (C00093)
Monoacyl-glycerol (C01885)
Trang 11ILLUSTRATIVE PROBLEM NO 1
PHOSPHOLIPID CYCLE INTERESTING TOPOLOGY
2 instances of a bifurcation point (where onesubstance is distributed to two different reactions)
External supply of fatty acid (C00162) is
glycerol (C00116) is externally supplied and
glycerol (C00116) is produced by the reaction
catalyzed by Glycerol-1-phosphatase(EC3.1.3.21)
1 internal feedback loop (in which a substance isboth consumed and produced)
Glycerol (C00116) is consumed (in part) by the
reaction catalyzed by Glycerol kinase(EC2.7.1.30)
Trang 12 This reaction, in turn, produces an intermediate
(C00093)
This intermediate substance is, in turn, consumed
by the reaction catalyzed by phosphatase (EC3.1.3.21)
Glycerol-1- That reaction, in turn, produces glycerol
(C00116)
Trang 13FOUR REACTIONS FROM THE PHOSPHOLIPID CYCLE
Trang 14ILLUSTRATIVE PROBLEM NO 2
SYNTHESIS AND DEGRADATION OF KETONE
3-oxoacid CoA-transferase (EC 2.8.3.5)
4.1.3.5)
Hydroxymethylglutaryl-CoA lyase (EC 4.1.3.4)
1 intermediate substance
INT-1
Trang 15LLUSTRATIVE PROBLEM NO 2 SYNTHESIS AND DEGRADATION OF KETONE BODIES
3 NOTEWORHTY TOPOLOGICAL FEATURES
1 instance of a bifurcation point (where one
substance is distributed to two different reactions)
Acetoacetyl-CoA
2 accumulation points
Acetyl-CoA is an externally supplied substance
and is produced by the reaction catalyzed byHydroxymethylglutaryl-CoA lyase (EC4.1.3.4)
Acetoacetate is produced by the reaction
catalyzed by 3-oxoacid CoA-transferase (EC2.8.3.5) and by the reaction catalyzed byHydroxymethylglutaryl-CoA lyase (EC4.1.3.4)
1 internal feedback loop (in which a substance is
both consumed and produced) Acetyl-CoA is consumed
by the reaction catalyzed by Hydroxymethylglutaryl-CoAsynthase (EC 4.1.3.5)
Trang 16 This reaction, in turn, produces an intermediate
substance (INT-1)
This intermediate substance is, in turn, consumed
by the reaction catalyzed byHydroxymethylglutaryl-CoA lyase (EC4.1.3.4)
That reaction, in turn, produces Acetyl-CoA
Trang 17THREE REACTIONS INVOLVED IN THE SYNTHESIS AND DEGRADATION OF KETONE
BODIES
Trang 18GENETIC PROGRAMMING
(1) Generate an initial population of compositions(typically random) of the functions and terminals of theproblem
(2) Iteratively perform the following substeps(referred to herein as a generation) on the population ofprograms until the termination criterion has been satisfied:
(A) Execute each program in the population and
assign it a fitness value using the fitnessmeasure
(B) Create a new population of programs by
applying the following operations Theoperations are applied to program(s) selectedfrom the population with a probability based
on fitness (with reselection allowed)
(i) Reproduction(ii) Crossover (Sexual recombination)(iii) Mutation
(iv) Architecture-altering operations (3) Designate the individual program that is
identified by result designation (e.g., the so-far individual) as the result of the run ofgenetic programming This result may be a
Trang 19best-solution (or an approximate best-solution) to theproblem
Trang 20ARCHITECTURE-ALTERING OPERATIONS
The individual programs that are evolved by geneticprogramming are typically multi-branch programsconsisting of one or more result-producing branches andzero, one, or more automatically defined functions(subroutines)
The architecture of such a multi-branch program
involves
the total number of automatically defined
functions,
the number of arguments (if any) possessed by
each automatically defined function, and
if there is more than one automatically defined
function in a program, the nature of thehierarchical references (including recursivereferences), if any, allowed among theautomatically defined functions
Architecture-altering operations enable geneticprogramming to automatically determine
the number of automatically defined functions,
Trang 21 the number of arguments that each possesses, and
the nature of the hierarchical references, if any,
among such automatically defined functions
Trang 22AUTOMATIC SYNTHESIS OF ANALOG
ELECTRICAL CIRCUITS LOWPASS FILTER CIRCUIT
TIME DOMAIN BEHAVIOR OF A LOWPASS FILTER TO A 1,000 HZ SINUSOIDAL INPUT
SIGNAL
TIME DOMAIN BEHAVIOR OF A LOWPASS FILTER TO A 2,000 HZ SINUSOIDAL INPUT
SIGNAL
Trang 23FREQUENCY DOMAIN BEHAVIOR OF A
LOWPASS FILTER
Trang 24LOWPASS FILTER CREATED BY GENETIC PROGRAMMING THAT INFRINGES ON GEORGE
CAMPBELL'S PATENT
Trang 25SQUARING COMPUTATIONAL CIRCUIT CREATED BY GENETIC PROGRAMMING
Trang 26RISING RAMP 1 OF 4 TIME-DOMAIN SIGNALS USED TO CREATE SQUARING
COMPUTATIONAL CIRCUIT
OUTPUT FOR RISING RAMP INPUT FOR
SQUARING CIRCUIT
Trang 27AUTOMATIC SYNTHESIS OF CONTROLLERS EVOLVED CONTROLLER THAT INFRINGES
ON JONES' PATENT
Trang 28AUTOMATIC SYNTHESIS OF ANTENNAS ANTENNA DESIGN CREATED BY GENETIC
PROGRAMMING
Trang 29ONE-SUBSTRATE, ONE-PRODUCT CHEMICAL
REACTION
One chemical (the substrate) is transformed into another chemical (the product) under control of a catalyst
CHANGING CONCENTRATIONS OF SUBSTANCES IN AN ILLUSTRATIVE ONE- SUBSTRATE, ONE-PRODUCT REACTION
Trang 300 10 20 30 40 50 60 0
Trang 31CHEMICAL REACTIONS
The action of an enzyme (catalyst) in a one-substratechemical reaction can be viewed as two-step process inwhich the enzyme E first binds with the substrate S at a
rate k1 to form ES The formation of the product P from
ES then occurs at a rate k2 The reverse reaction (for thebinding of E with S) in which ES dissociates into E and S,
occurs at a rate of k-1
E P ES S
The concentrations of substrates, products,intermediate substances, and catalysts participating inreactions are modeled by various rate laws, including
first-order rate laws,
second-order rate laws, power laws, and
Michaelis-Menten equations
Michaelis-Menten rate law for a one-substratechemical reaction is
m t
t
K S
S E k dt
P d
] [
] [ ] [ ]
1
2 1
k
k k
Psuedo-first-order rate law
Trang 32t new
m
K
S E
[ ]
[
]
[
0 0
m new K
k
k 2
Trang 33E LECTRICAL CIRCUIT REPRESENTING THE ILLUSTRATIVE ONE-SUBSTRATE-ONE-
PRODUCT ENZYMATIC REACTION
Trang 34SUM-INTEGRATOR
Trang 35SUBCIRCUIT FOR ONE-SUBSTRATE MICHAELIS-MENTEN EQUATION MICH_1
Subcircuit definition in SPICE for the one-substrateMichaelis-Menten equation MICH_1
*NETLIST FOR MICHAELIS-MENTEN MICH_1XXM4 4 3 2 XDIVV
XXM3 6 5 3 XADDV
XXM2 7 8 4 XMULTV
XXM1 9 5 8 XMULTV
.SAVE V(2) V(3) V(4) V(5) V(6) V(7)V(8) V(9)
.END
Trang 36ONE-SUBSTRATE, TWO-PRODUCT REACTION
Trang 37CIRCUIT FOR ILLUSTRATIVE SUBSTRATE, TWO-PRODUCT CHEMICAL
ONE-REACTION
Trang 38TWO-SUBSTRATE, ONE-PRODUCT REACTION
E P ABE B
Michaelis-Menten rate law for a two-substratechemical reaction is
t t AB t
B t A
t
B A K B K A K K
E Rate
] [ ] [
1 ]
[
1 ]
[
1 1
] [ 0
][
[
1 A B E k
Rate t
Trang 39CIRCUIT FOR TWO-SUBSTRATE, PRODUCT CHEMICAL REACTION
Trang 41ONE-TWO-SUBSTRATE MICHAELIS-MENTEN
EQUATION MICH_2
t t AB t
B t A
t
B A K B K A K K
E Rate
] [ ] [
1 ]
[
1 ]
[
1 1
] [ 0
Trang 43REPERTOIRE OF FUNCTIONS IN PROGRAM
returns the first of the one or two products produced bythe chemical reaction function designated by its argument
returns the second of the two products (or, the firstproduct, if the reaction produces only one product)
Trang 44REPERTOIRE OF TERMINALS IN THE
PROGRAM TREE
Substances
externally supplied input substances
intermediate substances created by reactions
output substances
Enzymes
Numerical constants for the rate of the reactions
Trang 45PROGRAM TREE CORRESPONDING TO METABOLIC PATHWAY FOR PHOSPHOLIPID
CYCLE
Trang 46REPRESENTATION OF PHOSPHOLIPID CYCLE AS A SYMBOLIC EXPRESSION
Trang 4701885 ][
00162 [
45 1 ] 00165 [
C C
dt
C d
01885 ][
00162 [
45 1 - 3.1.1.23]
EC ][
00116 ][
00162 [
95 1 ]
01885
[
C C
C C
dt
C
d
Supply and consumption of the intermediate
substance sn-Glycerol-3-Phosphate (C00093) in the
internal feedback loop
3.1.3.21]
EC ][
00093 [
19 1 - 2.7.1.30]
EC ][
00002 ][
00116 [
69 1 ] 00093
[
C C
C dt
00002 ][
00116 [
69 1 5 1 ] [
C C
dt
ATP d
01885 ][
00162 [
45 1 - 3.1.1.23]
EC ][
00116 ][
00162 [
95 1 2 1 ]
00162
[
C C
C C
][
00116 ][
00162 [
95 1 - 2.7.1.30]
EC ][
00002 ][
00116 [
69 1 - 3.1.3.21]
EC ][
00093 [
19 1 5 0 ]
00116
[
C C
C C
C dt
C
d