The emphasis of this research is to develop a framework by systematically combining both statistical and in silico approaches to identify important nutrient components in a culture medi
Trang 1A COMBINED STATISTICAL AND IN SILICO FRAMEWORK
FOR ANALYSIS AND CHARACTERIZATION OF
MICROBIAL AND MAMMALIAN METABOLIC NETWORKS
SELVARASU SURESH
(B Tech, University of Madras, Chennai, India)
A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF CHEMICAL & BIOMOLECULAR ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2009
Trang 2Acknowledgements
It is with great pleasure that I take this opportunity to express my gratitude to all those who have helped me in my research progress and more so in shaping my PhD into an enriching experience The research guidance that I got through my advisors Prof I A Karimi and Dr Lee Dong-Yup
at NUS was much more than what I had expected With due respect, I express my sincere gratitude to them for being wonderful and inspiring supervisors Without their immense support, timely inputs, guidance and encouragement my progress was impossible There is no word to explain their influence on my research I also wish to thank them for involving me in several projects and especially in collaborations with research institutes (BTI) which provided me a very good chance to learn more It was indeed a privilege to work with them
I would like to thank Dr Victor Wong and Dr Dave Ow from BTI, Singapore for their immense help in providing me the experimental data I appreciate their patience in explaining the nuances
of experimental strategies whenever I approached them I extend my thanks to A/P Loh Chee and A/P Sanjay Swarup for their kind acceptance to be on the panel of examiners and for valuable suggestions for planning this research during the qualifying exam I also do thank the final reviewers for spending time on evaluating this thesis I also express my gratitude to Dr Lakshminarayanan for his valuable suggestions at different times during my PhD
Kai-I wish to admire and thank all the unknown reviewers of our publications, who gave constructive feedbacks on all our manuscripts and helped us to bring the best out of this research I also take this opportunity to appreciate and thank all those dedicated researchers who shared their research
Trang 3in the form of literature, website notes, and freely available online data These informations have played a major part in strengthening this research work
I also express my gratitude to all the professors at ChBE/NUS whose valuable lectures/seminars have resulted into good ideas for this research Special thanks to Prof I A Karimi and Prof Rangaiah for giving me an opportunity to teach undergraduate students (it was en enriching experience for me at NUS) that fetched me best tutor award It is indeed an honor I also thank ChBE department for financially supporting all my conference visits
Special thanks to all my labmates and other friends at NUS (if I start naming them the list would keep rolling) for their affectionate support and interactions that made my journey in NUS, a wonderful experience I would also thank all the GSA office bearers for helping me in one way
or the other and bringing the best out of me at different times during GSA activities Lastly, I thank all my professors, students and affectionate friends who trained and inspired me to be what
I am today I will cherish this wonderful journey for long
And most importantly, I thank my parents (Mr Selvarasu, and Mrs Vijaya), my sister (Mrs Veni), my niece and nephew (Dharani and Nirmal) for always being my source of inspiration Their love, continued support and motivation were the main driving force for me during my PhD I am ever grateful and indebted for their care and affection
Trang 4Table of Contents
List of Tables v
List of Figures vi
Nomenclature x
Abbreviations xi
SUMMARY xii
1 Introduction 1
1.1 Cellular organisms and their complex functioning 1
1.2 Systems biology a new paradigm in biological research 2
1.2.1 Knowledge required for systems biology 3
1.2.2 Approaches in systems biology 5
1.2.3 Opportunities to unravel biological functions 6
1.3 Analysis techniques available in the data rich environment 7
1.4 Motivation for research 8
1.5 Scope of the present work 10
1.6 Organization of the thesis 11
2 Modeling and analysis of biological systems: An overview 14
2.1 Tools available for modeling biological systems 15
2.2 Genome-scale modeling 17
2.3 Constraints-based modeling approach 18
2.4 Other metabolic network simulations 19
2.5 Algorithms available for characterizing metabolic networks 27
2.6 Systems biotechnology: An approach for systematic strain improvement 29
2.7 In silico techniques available for strain improvement 29
2.8 Tools for multivariate data analyses 31
2.9 Research directions 33
3 Framework for combined analysis using statistical and in silico approaches 35
3.1 Introduction 35
3.2 Experimental data and their trend 36
3.3 Data preprocessing and elemental balancing 38
Trang 53.3.1 Cumulative consumption and specific rates calculation 39
3.4 Multivariate statistical data analysis (PCA and PLS) 42
3.4.1 Principal component analysis (PCA) 42
3.4.2 Partial least squares regression (PLS) 43
3.5 In silico modeling and analysis 43
3.5.1 Metabolic network reconstruction 45
3.5.2 Constraints-based flux analysis 48
3.6 Application of the framework 50
4 Application of framework for characterizing Escherichia coli DH5α growth and metabolism in a complex medium 51
4.1 Introduction 51
4.2 Materials and methods 53
4.2.1 Strains and culture conditions 53
4.2.2 Analytical techniques 54
4.2.3 Data preprocessing for statistical analysis 55
4.2.4 Constraints-based flux analysis 55
4.3 Results and discussion 56
4.3.1 Growth, metabolite uptake and excretion profiles during batch culture 56
4.3.2 Elemental balancing 60
4.3.3 Multivariate statistical analysis 60
4.3.4 In silico metabolic flux analysis 63
4.3.5 Sensitivity analysis of amino acid and glucose consumption 71
4.3.6 Analysis of the metabolite consumption and utilization 72
4.3.7 Availability of other nutrients in the medium 80
4.3.8 Exploring the statistical analysis results using in silico analysis 82
4.4 Concluding remarks 84
5 Genome-scale modeling and in silico analysis of mouse cell metabolism 86
5.1 Introduction 86
5.2 Materials and methods 88
5.2.1 Metabolic network reconstruction 88
5.2.2 Network visualization 91
Trang 65.2.3 Statistical network analysis 92
5.2.4 Constraints-based flux analysis 92
5.3 Results and discussion 93
5.3.1 Genome-scale reconstruction of mouse metabolic network 93
5.3.2 Comparison of mouse model with yeast and E coli genome-scale models 97
5.3.3 In silico model validation 99
5.3.4 Structural and functional characterization of mouse metabolism 104
5.3.5 Important role of lipid pathway in mouse metabolism 112
5.3.6 Alternate flux distributions and flux variations 114
5.4 Conclusion 116
6 Application of framework to elucidate mouse hybridoma cell growth and metabolism in a fed-batch culture 118
6.1 Introduction 118
6.2 Materials and methods 121
6.2.1 Cell line and culture medium 121
6.2.2 Analytical techniques 122
6.2.3 Data preprocessing for statistical analysis 122
6.2.4 Constraints-based flux analysis 124
6.3 Results and discussion 125
6.3.1 Fed batch cell culture 125
6.3.2 Elemental Balancing on Fed-batch Data 130
6.3.3 Multivariate Statistical Analysis 132
6.3.4 In silico metabolic flux analysis 136
6.3.5 Other possible cellular objectives 153
6.3.6 Understanding cellular behavior from combined analysis 154
6.4 Conclusion 157
7 Identification of necessary genes and evaluating their perturbations for strain improvement in E coli 159
7.1 Introduction 159
7.2 Algorithm for identifying sufficient and necessary genes 160
Trang 77.2.1 Mathematical formulations and algorithm 161
7.2.2 Identifying set of necessary genes 163
7.3 Application of the algorithm 165
7.3.1 Analysis in E coli DH5α metabolic network 165
7.4 Application of the necessary gene sets to identify knockout combinations for succinate production 168
7.5 Concluding remarks 172
8 Contributions and future recommendations 173
8.1 Summary of the contributions 173
8.2 Future directions 177
8.2.1 Expanding the horizon of mouse cell metabolism 177
8.2.2 Reconstruction of metabolic network of CHO cell lines 180
References 182
Appendices 197
List of Publications 198
VITAE 200
Trang 8List of Tables
2.1 List of available genome-scale models for various organisms 20
3.1 List of public resources available for reconstruction of genome-scale metabolic
models* 47
4.1 Comparison of metabolic reaction fluxes of amino acids biosynthetic reactions 73
4.2 Sensitivity of amino acids, glucose and trehalose uptake on cell biomass
production in phase 1a 74
4.3 Sensitivity of amino acids, glucose and trehalose uptake on cell biomass
production in phase 2a 75
4.4 Consumption or production of amino acids for biosynthetic demand as well as for
other metabolites production in phase 1a 77
4.5 Consumption or production of amino acids for biosynthetic demand as well as for
other metabolites production in phase 2a 78
4.6 Comparison of ATP consuming metabolic pathways for complex and minimal
medium conditions 79
5.1 Online resources for reconstructing genome-scale mouse metabolic network 89
5.2 Characteristics of the mouse genome-scale metabolic network and its comparison
with the previous generic model 95
5.3 Comparison of mouse genome-scale network characteristics with yeast and E coli
6.1 Summary of specific consumption or production rate of measured metabolites
during the exponential growth phase of the cell culture a 123
6.2 Production and utilization of pyruvate in central metabolism during the exponential
growth phase of the cell culture a 141
6.3 Energy production from central carbon metabolism in all statesa 144
7.1 List of necessary reactions for both cell growth and succinate production 167
7.2 List of double knockout gene combinations that enhances succinate production in
E coli DH5α 170
Trang 9List of Figures
1.1 Interaction of the different expertise in performing a systems biology research 4
1.2 Flowchart showing the major focus of the current research work and the
organization of the addressed research issues in different chapters of the thesis 13
2.1 Genome-scale reconstruction of metabolic network and elucidation of the systemic
properties using constraints-based analysis approach 22
3.1 Schematic illustration of the workflow involved in the analysis using combined
statistical and in silico framework 37
4.1 Profiles of optical cell density and residual concentration of various nutrient
components and products in the complex medium Highlighted regions correspond
to three different growing phases of the culture Phase 1: initial exponential growth
phase; phase 2: late exponential growth phase; phase 3: acetate consumption phase
A: Optical density values (OD600), concentration of glucose, trehalose and
acetate B: concentration of amino acids which were rapidly consumed; L-aspartate
(ASP), glycine (GLY), proline (PRO), methionine (MET), serine (SER),
L-asparagine (ASN), L-tyrosine (TYR), L-threonine (THR), L-glutamate (GLU) and
L-alanine (ALA) C: concentration of amino acids which were not completely
consumed; L-valine (VAL), L-lysine (LYS), L-isoleucine (ILE), L-leucine (LEU),
L-phenylalanine (PHE), L-histidine (HIS) and L-arginine (ARG) 59
4.2 Results obtained from multivariate statistical analysis 61
4.3 Results of PLS analysis Black arrows indicate positive correlation between those
amino acids and cell growth Dotted arrows indicate positive correlation between
those amino acids and acetate production The negative effect of set of amino acids
on acetate is shown using bold lines and on cell growth is shown with dashed line
A: correlation based on PLS and B: strategies for feed medium design for
enhancing cell viability 62
4.4 Specific consumption rates of all the measured nutrients and specific growth rate
during initial exponential phase (phase 1) and the late growth phase (phase 2) The
value for histidine in phase 1 corresponds to its specific production rate The rates
are ranked according to their specific consumption rates in phase 1 65
4.5 Schematic diagrams of metabolic flux distributions and flux-sum across the
metabolites serine, pyruvate and acetate A: Metabolic flux distribution across the
central metabolic pathways and amino acids biosynthetic pathways during the
exponential growth phase (phase 1: underlined) and late growth phase (phase 2:
normal) of the microbial culture Reactions with higher flux values are highlighted
with red (phase 1) and green (phase 2) Serine, pyruvate and acetate are
highlighted with squares B: consumption and production of the metabolites serine,
pyruvate and acetate are shown using the flux-sum values across each of the
Trang 10metabolites for phase 1 and phase 2 Percentage contributions to each of the
metabolites are also shown PEP, Phosphoenolpyruvate; GLC, glucose; PYR,
pyruvate; GLY, glycine; TRE, trehalose; MAL, L-malate; TRP, L-tryptopan;
ALAC-S, (S)-2-acetolactate; ACCOA, acetyl coenzyme A; 23DHDP,
2,3-dihydrodipicolinate; 2AHBUT, (S)-2-Aceto-2-hydroxybutanoate; ACSER,
O-acetyl-L-serine; PS_EC, phosphatidylserine; CIT, citrate Annotation of other
metabolites follows that of the iJR904 model (Reed et al., 2003) 68
4.6 Interpretation of statistical and in silico analysis results A: set of positively
correlated amino acids with cell growth and acetate production and the
intracellular conversion of amino acids into various metabolites B: the plausible
effect of reducing amino acids (gly, ile, val and his) in the complex medium at the
intracellular level Arrow with bold outline: positive correlation with cell growth
and arrow with dashed line: positive correlation with acetate production 83
5.1 Schematic representation of the iterative approach employed in the reconstruction
and analysis of genome-scale mouse model The existing model was used as
template and the network was expanded by compiling the information (genome,
biochemical and mouse physiological data) Missing links and redundant reactions
were then identified to refine the model with such available resources The
resultant expanded model underwent the validation process using constraints-based
flux analysis with cell culture and in vivo gene essentiality data for verifying the
prediction The presence of knowledge gaps was explored and again the model can
be improved interactively Subsequently, the model was analyzed both structurally
and functionally to characterize mouse metabolism and identify key pathways,
reactions and metabolites 90
5.2 Functional classifications of metabolic reactions in mouse genome-scale model,
(A) current updated model and (B) old model Numbers on pie charts indicate
reactions in each subsystem Metabolic subsystems with number of gene and
non-gene associated reactions are detailed in the table 96
5.3 Comparison of metabolites across mouse, yeast and E coli genome-scale models
Metabolites from cytosol were only considered for comparison 99
5.4 Comparison of in silico growth rate with experimentally observed growth rate
during batch culture Specific growth rate is in h-1; mAb production rate in mg
gDCW-1 h-1 The bars with black and white colors represent specific consumption
and production rates, respectively 101
5.5 Comparison of in silico substrate requirements with experimentally observed
substrate requirements for cell growth Essential nutrients in the media are
highlighted in red colour and non-essential nutrients are highlighted in blue colour 102
5.6 The connectivity of metabolites in different reactions in the metabolic network
The reactions involved in significantly improved metabolic subsystems such as
carbohydrates, lipids and amino acids metabolisms are indicated by their edge
colours: green, blue and red, respectively Metabolites colors: blue - cytosol, red -
Trang 11mitochondria, green - extracellular and yellow - cofactors Metabolites and
reactions from amino acids, lipids and carbohydrates metabolism were extracted to
draw individual edge generated graphs Essential reactions and metabolites in the
sub networks are highlighted using cross and star-shaped nodes Network diameter
and average path lengths (APL) for the main network and the three sub-networks
are also shown 106
5.7 Correlation between metabolite degree and betweenness centrality for (A) all
metabolites, (B) essential metabolites and (C) non-essential metabolites The
metabolite can be identified as essential when its removal leads to no growth
Highly-connected, bridging metabolites are highlighted in (A) ACP: acyl carrier
protein, ACCOA: acetyl-coA, ACCOAm: acetyl-coA mitochondiral, AKG:
α-ketoglutarate, AKGm: α-ketoglutarate mitochondrial, AMASA:
L-2-Aminoadipate-6-semialdehyde, ANA: N-acetylneuraminate, CAR: carnitine,
GLAC: D-galactose, GLC: D-glucose, GLU: L-glutamate, GLY: glycine,
MALACP: malonyl-[acyl-carrier-protein], PPIXm: Protoporphyrin mitochondrial,
PYR: pyruvate, SAH: homocysteine, SAM:
S-adenosyl-L-methionine, SUCC: succinate, SER: L-serine and URI: uridine 107
5.8 Visualization of the ACCOA interaction across lipid metabolism, TCA cycle and
glycolysis The enlarged section shows the high connectivity and bridging
characteristics of ACCOA Blue edges: lipid metabolic reactions, green: TCA
cycle and red: glycolysis ACCOA: acetyl-coA 108
5.9 Comparison of (A) metabolite flux-sum and (B) metabolic flux distribution during
cell growth under normal and AKG deletion conditions Metabolites flux sum and
flux distributions in carbohydrates and nucleotides metabolisms are shown in the
enlarged sections Blue and red color bars represent normal and AKG deletion
conditions, respectively AKG: α-ketoglutarate 110
5.10 Classification of essential (A) reactions and (B) metabolites according to different
metabolic subsystems in the mouse metabolism 113
5.11 Reaction usages in multiple optimal flux distribution The graph shows the fraction
of the metabolic flux distributions that utilize a specific reaction categorized under
different subsystems 115
6.1 Profiles of viable cell density, mAb, amino acids, glucose, OUR, lactate and
ammonia in the fed batch culture A: Viable cell density and mAb concentration
B: Glucose, glutamine, OUR, lactate and ammonia concentrations C:
Concentration profiles of all essential amino acids D: Non-essential amino acids
concentrations mAb- monoclonal antibodies (IgG1); ARG- arginine; THR-
threonine; SER- serine; GLY- glycine; TYR- tyrosine; PHE- phenylalanine; MET-
methionine; HIS- histidine; ASN- asparagine; ASP- aspartate; LYS- lysine; VAL-
valine; ILE- isoleucine; GLU- glutamate; LEU- leucine; ALA- alanine; GLN-
glutamine; GLC- glucose; LAC- lactate; NH3- ammonia; OUR- oxygen uptake
rate The concentration of amino acids tryptopan, cysteine and proline were
negligible 128
Trang 126.2 Summary of the results from multivariate statistical analysis using PCA and PLS
for fed-batch mouse hybridoma cell culture Amino acids consumption/production
rates were clustered using PCA Correlation between the variables obtained from
PLS analysis is also shown PCA- Principal Component Analysis; PLS- Partial
Least Squares 135
6.3 Schematic illustration of the correlation identified by PLS analysis Dotted lines
indicate the negative interaction of the amino acids (asp, glu and ala) with cell
growth and mAb production rate 136
6.4 Experimental and simulated growth rates for different time points during the
exponential growth phase of the culture 138
6.5 Metabolic flux distributions across the carbohydrate metabolism in hybridoma
cells Flux across the three pathways including glycolysis, pentose phosphate
pathway and TCA cycle are shown for all the 12 time points during the exponential
growth phase 140
6.6 Overall distributions of simulated internal fluxes across different metabolic
pathways on the left for time point V in Figure 3 The expanded region on the right
details the simulated flux values within the central carbon metabolism Bar length
and the direction indicates the minimum and maximum possible flux values
achieved by flux variability analysis 142
6.7 The resulting flux distributions from MFA illustrate consumption of all essential
and non-essential amino acids from the media and subsequent utilization of all
essential amino acids for the production of non-essential amino acids within the
cell 149
6.8 Metabolic activities of the consumed nutrients inside the cell Metabolites in purple
colour, EAA; green, NEAA, black, ala, glu, lac and NH3; red, cell growth and
mAb EAA, essential amino acids; NEAA, non-essential amino acids 156
7.1 The algorithm represents an iterative the method to identify the set of sufficient
genes and their corresponding reactions For executing the algorithm the cellular
objectives (growth, biochemical productions) are fixed at different levels of their
maximum values and minimum sets of genes are determined 162
7.2 Illustration of sufficient genes identification approach Circles 1, 2, and 3 represent
different levels of cellular objectives The shaded region in dark circle shows the
essential set of genes for achieving cellular objective values for all the three cases
and the remaining regions correspond to necessary genes 163
7.3 Number of sufficient genes required for maintaining cell growth rates 166
7.4 Succinate production limits for wild type and the mutants The bold line indicates
the limits for wild type strain and the points indicate double knockout mutants
Red color circle: result of SUCD1i/SUCD4 and NADH6 knockout Blue color
circle: result of SUCD1i and PGL Other combinations are described in table 7.2 168
Trang 13Nomenclature
V Culture volume (ml)
X v Viable cell concentration (106 cells-1 ml-1)
µ Specific growth rate (h-1)
q s Specific substrate consumption rate (mmol h-1 cell-1)
q p Specific production rate (mmol h-1 cell-1)
S Substrate concentration (mM)
P Product concentration (mM)
S f , Substrate feed concentration (mM)
P f Product feed concentration (mM)
F Feed flow rate (ml h-1)
t Time (h)
v j Reaction flux (mmol gDCW-1 h-1)
α j Lower bound for reaction flux (mmol gDCW-1 h-1)
β j Upper bound for reaction flux (mmol gDCW-1 h-1)
S ij Stoichiometric coefficient of metabolite i in reaction j (dimensionless)
Z objective function in the optimization problem
c j Weight associated with the reaction fluxes in objective function (dimensionless)
M Number of metabolites in the network (dimensionless)
N Number of reactions in the network (dimensionless)
Trang 14Abbreviations
FBA Flux Balance Analysis
GAMS The General Algebraic Modeling System
IgG1 Immunoglobulin G
LP Linear Programming
mAb Monoclonal Antibody
MDS Multidimensional Scaling
MFA Metabolic Flux Analysis
MILP Mixed Integer Linear Programming
MINLP Mixed Integer Nonlinear Programming
MOMA Minimization of Metabolic Adjustments
OMNI Optimal Metabolic network identification
PCA Principal Component Analysis
PCR Principal Component Regression
PLS Partial Least Squares regression
QP Quadratic Programming
ROOM Regulatory On/Off Minimization
Trang 15SUMMARY
With advances in new experimental technologies, high throughput experimental data are
generated for describing micro/ macro-molecular cell functions of complex biological
systems Understanding these functions is essential for improvements in biomedical
research and more importantly for biotechnological processes Microbial and mammalian
cells are commonly used by these processes for producing very high-value therapeutics
In recent years, there is an increasing demand for these compounds that points to the need
for improved cell culture performance However, there are complexities associated with
the cell culture mainly due to deviations in the culture conditions, heterogeneous
interactions among different variables in the culture and between different cellular
components, which make it difficult to elucidate the cellular functions In addition,
accumulation toxic metabolites in the culture also lead to reduced productivity or cell
death These complexities pose a major challenge in developing high yielding cell
cultures Motivated from these challenges, the main objectives of this research include
reviewing potential unresolved issues pertaining to understanding the complex
functionalities associated with microbial/ mammalian metabolisms, and resolving them
using suitable techniques, which would enable us to improve the performance of
fermentation processes for producing high-value therapeutics
Multivariate statistical techniques have often been used to extract biologically
relevant information from the high throughput experimental data, even though they do
Trang 16not provide any insights into the organism’s internal cell metabolic activities To deal
with this, genome-scale modeling approaches can be useful in improving our
understanding of the internal cellular metabolism of organisms Thus, these two
approaches can be concomitantly used to better understand and characterize the complex
microbial and mammalian cellular systems
The emphasis of this research is to develop a framework by systematically
combining both statistical and in silico approaches to identify important nutrient
components in a culture media based on the experimental data and to study the effect of
these components on the internal metabolic behavior of cellular systems This
understanding would be crucial for modifying/designing organisms for enhancing
byproduct yield and in developing efficient biotechnological processes
The major research issues addressed in this work and their corresponding outcomes are:
Combined framework: The first part of the thesis involves development of the
combined framework using multivariate statistical analysis techniques and in silico
modeling approaches for characterizing cell culture fermentation and exploring the
internal cell metabolism The most relevant statistical methods for examining the
experimental data are described Subsequently, various steps and procedure involved in
reconstructing a genome-scale metabolic model and conducting in silico analysis are also
detailed
Application to microbial system: The second part of the thesis includes application of
the framework to microbial metabolic networks E coli was chosen as the model
organism due to its applicability to biotechnological processes The framework was
Trang 17applied to examine the growth and metabolism of E coli DH5α strain grown in a
complex medium Highly correlated nutrients from the culture media were obtained using
statistical analysis and the effect of nutrient consumption on intracellular metabolism was
explored using constraints-based genome-scale modeling
Application to mammalian system: The third part of the thesis considers analysis and
characterization of mammalian metabolic system In this case, mouse cell lines were used
due to their high degree of application to both biomedical and biotechnological
communities Initially, we have reconstructed the mouse cell metabolic network by
resorting to the genome-scale modeling approach and investigated its structural properties
Subsequently, statistical analysis was performed for a fed-batch culture of mouse
hybridoma cells grown in a complex medium, producing IgG1 (monoclonal antibody) In
silico analysis was then performed using the reconstructed model to elucidate the internal
metabolic states of mouse cells based on the observations from statistical analysis
Strain improvement strategies: The last part of the thesis deals with the development of
a novel optimization algorithm to identify set of necessary genes/reactions in the
metabolic network for cell growth and byproduct production This algorithm can be used
to select gene knockout candidates for mutant phenotypes that can enhance the yield of
desired byproducts (ex amino acids, succinate, etc.) The applicability of this approach
can be easily tested and verified experimentally for developing high-yielding microbial/
mammalian cell lines
Trang 181 Introduction
1.1 Cellular organisms and their complex functioning
Cellular functions are often complex due to the high degree of interaction among various
molecules and organelles within a cell and across cells Based on the level of these
internal complexities, living cells/systems have been mainly classified into prokaryotes
and eukaryotes Prokaryotes possess simplest cell structure Nevertheless, their
functioning is highly complex due to their molecular interactions at different time and
spatial scales Eukaryotes exhibit much more complex functions due to the presence of
different organelles within the cell thus making it even harder for understanding their
functions Until recently most of the biological research was devoted to understand the
properties of isolated molecules by reducing the complexity involved in biological
systems However, in reality, cell functions will definitely vary under interactions The
presence of surrounding molecules in a cellular environment may result in activation,
suppression, or regulation of a molecule Thus, functions that arise from the interaction of
different molecules cannot be easily understood/ predicted by studying isolated molecules
Often biological systems vary significantly from physical systems due to their complex
microscopic and macroscopic behavior resulting from the interactions of several
thousands to few millions of different components (Hartwell et al 1999) This entails the
need for a higher level of approach for handling the complexities as well as integrating
functionalities and interactions at different levels for elucidating cellular functions both at
Trang 19microscopic and macroscopic levels Such inferences are not easily achieved by the
conventional reductionist approach
The availability of genome sequences for different microbial and mammalian
organisms and technological advancements in the field of genomics and high-throughput
experimental techniques have generated wealth of biological data that give information
on genes, mRNA, proteins, and metabolic products and their functions So far, biologists
have not effectively utilized these billions of data due to the challenges involved in
integrating them This complexity is coupled with the difficulty involved in integrating
different cellular organelle functioning Such integration underlies the emergence of
“Systems Biology”, an interdisciplinary research field that aims to develop a quantitative
understanding of cellular functions It involves characterizing different components of a
biological system using the knowledge and techniques of systems engineering (Kitano
2002) The Post-genomic era of cellular biology can focus on utilizing this approach to
understand the mechanisms through which biological functions emerge due to the
interaction of numerous molecular components
1.2 Systems biology a new paradigm in biological research
Systems biology is a new scientific discipline that studies the behavior of complex
biological organizations through the integration of diverse quantitative information and
mathematical modeling to generate predictive hypotheses and elucidate the functions of a
biological system (Aderem 2005; Hartwell et al 1999; Hood et al 2004; Westerhoff and
Palsson 2004) Although engineers have applied the concept of integrating systems
behavior of biological systems for years, the term systems biology came into emergence
Trang 20as a distinct research paradigm only in recent years The significance of this
interdisciplinary research area is evident from the number publications available in the
name of systems biology in ISI web of science search The number of articles in the topic
systems biology was merely 9 in 2001 This has grown several folds in recent years to an
extent that the number of such articles has exceeded 1000 in 2008 (ISI web of science)
Systems biology research has been propelled by the successes of molecular biology
and genetics, which have made genomic blueprints of numerous organisms, together with
extensive experimental data covering most aspects of cell functions They also present an
opportunity for a significant role of theory that can guide experiments by developing
increasingly complex hypotheses, formed on the basis of modeling the phenomena and
analyzing genomic and other experimental data Beyond the cell level, systems biology
addresses questions of how multi-cellular organisms develop and function, and how
populations interact on the ecological scale
1.2.1 Knowledge required for systems biology
The progress in systems biology requires a deep and detailed understanding of biological
systems, which is essential for identifying the "right” questions It requires development
of novel concepts geared towards living systems, which are extremely heterogeneous,
non-generic and nontrivially coupled to the environment Since it is an intrinsically
interdisciplinary research, it involves expertise and perspectives from different disciplines
such as engineering, biology, computer science, physics, chemistry and mathematics
Ideas and concepts from these diversified fields will enrich physical science as it strives
to describe the complexity of living matter Biology provides a complementary
Trang 21perspective from which to consider, analyze, and ultimately understand the living world,
whereas physics and chemistry come handy in probing the behavior of molecules and
their activity inside living cells Engineering applications can effectively harness the
power of the living system and solve problems that cannot be solved in any other way
Mathematics is important in developing accurate first principle models of a biological
system (to start with a small subsystem of a single cell) and then predicting dynamics
over time
Figure 1.1 Interaction of the different expertise in performing a systems biology research
Advanced technical expertise from bioinformatics, computation, statistical analysis,
and mathematical modeling are all pivotal for integrating and making sense of large and
complex datasets generated through high throughput experimental techniques Through
integration and modeling, these studies would allow us to better exploit the complexity of
genomics and extract their biological and clinical significance The integration and
modeling of such diverse information can vastly enhance the power of systems biology
approach and it would help us to decipher the mechanism behind the metabolic behavior
Trang 22and provide new insights for exploration This new technology can also be further
explored with various analyses, modeling, simulations, and design techniques that are
precisely used in electronic, control and system engineering Furthermore a combined
effort by science, engineering and mathematics can be useful in exploring the complex
functional interactions of the system (Fig 2.1)
1.2.2 Approaches in systems biology
Model driven analyses and their experimental validations are the two major components
of systems biology research Analysis of biological systems using this approach can be
mainly categorized into two types The first one is quantitative systems biology that deals
with the extraction of quantified information such as molecular responses in a biological
system to a given perturbation Some of the technology platforms used for this approach
is:
• Gene expression measurement through DNA micro arrays and SAGE
• Protein levels through two-dimensional gel electrophoresis and mass spectrometry,
including phosphoproteomics and other methods to detect chemically modified
proteins
• Metabolomics for small-molecule metabolites
• Glycomics for sugars
These techniques are frequently combined with large-scale perturbation methods,
including gene-based (RNAi, misexpression of wild type and mutant genes) and chemical
approaches using small molecule libraries Robots and automated sensors enable such
Trang 23large-scale experimentation and data acquisition These technologies are still emerging
and many face problems that the larger the quantity of data produced, the lower the
quality A wide variety of quantitative scientists (computational biologists, statisticians,
mathematicians, computer scientists, engineers, and physicists) are working to improve
the quality of these approaches and to create, refine, and retest the models until the
predicted behavior accurately reflects the phenotype seen
The second category in systems biology is utilized for deriving qualitative
predictions using knowledge from molecular biology to develop causal models
mimicking biological system of interest and proposing hypotheses that explain the
systemic properties These hypotheses can then be confirmed and used as a basis for
developing mathematical models for the system The causal models are used to explain
the effects of biological perturbations qualitatively while mathematical models are used
to predict how different perturbations in the system's environment affect the system
quantitatively
1.2.3 Opportunities to unravel biological functions
The two important questions that may arise from using systems biology approach are:
• Is systems biology suitable for exploring most of the complex biological problems?
• What kind of opportunities and challenges that this field of research provides and
what would be the intellectual outcome in future?
The main goal of systems biology is to utilize the knowledge available in systems
engineering and to get clear understanding of the basic biological functionalities at
Trang 24microscopic or macroscopic levels It can be foreseen that in a few decades from now,
systems biology research will generate a vast amount of new information about life
processes starting from the role of specific genes to the metabolism of whole organisms
This potential technology can possibly bring about changes in medicine, agriculture,
industry, bioremediation, and energy When such technology is utilized, the mysteries of
biological evolution can be unlocked and the knowledge gained can be useful for creating
something useful for humankind
1.3 Analysis techniques available in the data rich environment
Recent advances in experimental techniques, automation, and sophisticated measurement
technology have resulted in high precision, high speed, and high throughput data This
has initiated an extensive interest and investigations are carried out with the aim of
improving the quality of the data obtained from different biotechnological and
biomedical processes Huge amount of data sets are available in the public databases and
it is possible to do vast database searches and data mining to extract the information of
biological interest Increasing number of genomic projects has also accelerated the
availability of datasets that provide information on gene, protein and physiological data
of multitude of organisms Most of these projects are completed or currently in progress
The excessive reliance of biotechnological, biopharmaceutical, and biomedical industries
on the vast amount of specified datasets provides an opportunity to apply data processing
techniques to gain knowledge from the generated datasets However, the complexity of
the data obtained from these experiments poses a serious challenge to research
community This has resulted in systems level studies for querying and understanding the
biological data sets Various levels of statistical analysis techniques have been
Trang 25extensively employed for processing the experimental data and gain valuable
information The work presented here also attempts to perform statistical data analysis
mainly for the fermentation processes involving microbial and mammalian cell lines
producing different products ranging from important metabolites to recombinant proteins
1.4 Motivation for research
A detailed literature review of the significance of analyzing complex biological systems
and their functioning is provided in Chapter 2 with important subtopics The need for
systems level analysis of biological systems, higher confidence on the credibility of
computational analysis techniques, utilization of statistical data processing techniques for
bioprocesses are some of the important features that stand out in recent scientific research
literature Observations from this review identified important problems yet to be solved
from the following areas:
Systems biology - overview of the current research activities with more emphasis on
computational analysis of complex networks
Statistical analysis techniques - various techniques available for performing data mining
and preprocessing of experimental data obtained from different cell culture experiments
Microbial and mammalian metabolism - available genomic and biochemical
information for microbial and mammalian metabolic systems, their biotechnological
applications, and limitations
Trang 26Genome-scale models - an overview on available genome-scale models, and methods for
their reconstruction, which would enable us to develop similar models for microbial and
mammalian systems
Analysis of genome scale models- various analysis techniques available for
genome-scale models, their merits and limitations, and potential strain improvement techniques
Limitations of the existing methodologies or techniques were identified for further
improvements; existence of knowledge gaps in metabolic systems; and the need for a
combined systems approach to understand the biological systems behavior are the key
issues that motivated this research work Some of the potential challenges that are
addressed in this work are highlighted below:
• Challenges involved in complete understanding of the behavior of the biological
systems in particular, microbial and mammalian systems
• Challenges in addressing the complexity of data obtained from experiments such as
batch and fed-batch fermentation cultures
• Challenges in integrating and applying the data analysis techniques that are available
for these fermentation processes
• Challenges involved in the reconstruction of genome-scale models in terms of
available biological information
• Challenges in combining modeling and data analysis techniques
Trang 27• Challenges associated with the designing of new biological systems for strain
improvement
1.5 Scope of the present work
Advances in genomic revolution and increase in the availability of biological
experimental data motivated us to develop the core objectives of the current research
work It involves developing a frame work for modeling and analyzing microbial and
mammalian systems using a combined statistical and in silico approach to gain insights
about the effect of external cellular environment on the internal cell metabolic behavior
This would enable us to infer the systemic properties of the networks and propose
testable hypotheses for cellular reengineering through strain improvement Following are
some of the specific issues addressed in this study
• Review of various systems level analysis techniques available for analyzing complex
biological systems
• Review of various data analysis techniques available for processing biological data
and utilizing effective methods for preprocessing experimental data
• Identifying biological systems in the context of biomedical research and
biopharmaceutical applications
• Reconstructing metabolic reaction network of the identified organisms with available
genome information
• Identifying suitable in silico analysis techniques for understanding cellular genotype
and its relation to phenotype
Trang 28• Identifying cellular capabilities and validating the predictions with the available
experimental information
• Development of novel techniques for designing mutant phenotypes in the context of
strain improvement
Figure 1.2 highlights the important issues covered in this research It summarizes the
depth of research in terms of modeling approaches and their combinations for analyzing
biological systems and breadth in terms of applications to well known microbial and
mammalian metabolic systems The work has also focused on addressing the major issues
in developing strategies for biotechnological advancements in terms of strain
improvement for byproduct productions
1.6 Organization of the thesis
Chapter 2 gives an extensive review on the current research initiatives in systems biology
with more emphasis on computational approaches for available for modeling and analysis
of microbial/mammalian metabolism as well as the fermentation processes which use
them Various analysis techniques used to analyze the metabolic models are summarized
The merits and demerits of the techniques are also described in detail The major
challenges and bottlenecks existing with the current approaches are highlighted and the
objectives for this research work have been derived
Chapter 3 gives an overview of the modeling and analysis framework for the
experimental data and the metabolic networks The first step of the framework involves
collection of experimental data from fermentation culture, which is then followed by data
preprocessing and statistical analysis to gain information on extracellular/environmental
Trang 29effects on cell culture process The external environmental effect on internal cell
metabolism has been explored by constraints-based in silico analysis with the aid of
reconstructed genome-scale models of the organisms
Chapter 4 implements and tests the developed framework on one of the well-studied
microbe, E coli to gain insight within the cell metabolism from statistical as well as in
silico aspects
Chapter 5 describes the reconstruction of the genome-scale model of mouse (one of
the important mammalian cells) based on a previous generic version and from the
updated genome, biochemical and cell physiological data of Mus musculus
Chapter 6 applies the framework to mouse metabolism to elucidate the metabolic
behavior under varying environmental conditions We have effectively used the
reconstructed genome-scale mouse model for the in silico analysis
Chapter 7 provides information on the development of an efficient optimization
algorithm for identifying set of necessary genes that can be used as knockout candidates
for strain improvement This technique has been applied to both E coli metabolic
network for improving succinate production
Chapter 8 highlights the key findings and contributions of the current research
Potential extensions and future recommendations have been identified and suggested
The overall thesis organization is shown in figure 1.2
Trang 30Figure 1.2 Flowchart showing the major focus of the current research work and the organization
of the addressed research issues in different chapters of the thesis
Trang 312 Modeling and analysis of biological systems:
An overview
The normal and abnormal behaviors of a living cell/cellular system are governed by
complex networks of interacting biomolecules Modeling these networks allows us to
make predictions about the cellular behavior under a variety of environmental cues (Price
and Shmulevich 2007), which is essential to both scientific and commercial communities
The main advantage of using modeling approaches is the reduction in time and cost
required for conducting experimental investigations by pinpointing the effect of
important parameters or conditions on the system The value of modeling cellular
behavior using mathematical representation and in silico simulations of their complex
functions has been long recognized (Price et al 2003) So far, many theoretical
approaches have been attempted (Rigoutsos and Stephanopoulos 2007; Wolkenhauer
2002) to model biological systems by describing the functions of few atoms, small
systems, and even large systems such as studying the whole cellular functions of the
organisms (Guardia 2002)
Most modeling approaches attempted to build cellular systems based on metabolic
pathways as they were well characterized both qualitatively and quantitatively (Rigoutsos
and Stephanopoulos 2007; Stephanopoulos et al 1998; Varner and Ramkrishna 1999)
These works mainly focused on modeling with emphasis on in vitro cell cultures (Sidoli
et al 2004) and their behaviors under different conditions A macroscopic approach to
Trang 32biochemical networks was adopted to simplify the problem by lumping the cell regions
and species concentrations (Bower and Bolouri 2001) This major assumption helps in
formulating the cell as a reactor using kinetics and transport equations (Stephanopoulos
and Stafford 2002) Attempts have also been made to introduce the effect of randomness
through intrinsic (gene expression, mutations, intra-cellular product accumulation) and
extrinsic (cell growth, degradation, environmental changes) stochasticity into in silico
models (Kepler and Elston 2001; Meng et al 2004) This analysis helps to explain
stability, reliability and robustness of biosystems and explore the existence of multiple
stable state systems (Kauffman 1969; Kitano 2004)
2.1 Tools available for modeling biological systems
As the emphasis in this research is mainly on the modeling and analysis approaches, this
section reviews some of the important tools and techniques that are used for modeling of
biological systems and in particular metabolic networks
Many modeling approaches are currently being used to model cellular processes
Due to the presence of many parameters, variables and constraints a variety of numerical
and computational techniques are used in biosystems modeling and analysis (Haefner
1996) In the last two decades, various computational tools such as Cellware, Cell
designer, MetaFluxNet, Gepasi, KINsolver, etc has been proposed for modeling
biological systems Some of the commonly used modeling techniques in these tools are
described next
Kinetic modeling: This technique includes modeling of reaction kinetics for
understanding metabolic pathways (Steuer et al 2006) and in simulating gene interaction
Trang 33circuits (de Jong 2002) Dynamic simulation of biological systems using set of ordinary
differential equations (ODE) (Mendes 1993), models representing cell division and
growth cycle in bacteria [90], and quantitative metabolism of a whole (hypothetical) cell
(Tomita et al 1999) also use kinetic modeling approaches
Stochastic modeling: It identifies the dynamic interactions of different processes in a
complex biological system (Wilkinson 2006) It provides quantitative understanding of
the cell physiology at multiple scales Such modeling techniques have been successfully
applied to metabolic systems such as E coli to study their lactose regulation system
(Julius et al 2008)
Cybernetic modeling: These models also incorporate kinetic information and predict the
dynamic interactions of the biological system (Kompala et al 1984) The effects of
perturbations on enzyme levels on the rates of substrate production or product formation
are explored It also describes coupling between metabolic fluxes and environmental
conditions These models are used to describe complex dynamic phenomena such as
steady-state multiplicity, oscillatory behavior, unbalanced growth, and futile cycling in a
metabolic network (Kompala 1999; Varner and Ramkrishna 1999)
Although these methods are useful in providing results, they need many parameters
to model complex biological functions such as complete cellular metabolism, which also
increases computational complexity Thus, genome-scale modeling techniques are widely
applied for analyzing biological functions especially intracellular metabolism, as it does
not require any kinetic parameters The current research work mainly focuses on
Trang 34genome-scale modeling Thus an extensive and in depth review on this modeling technique is
given below
2.2 Genome-scale modeling
The sequencing of first bacterial genome signified a transition in biology from data poor
to data rich environment Since then various sets of ‘omics’ data such as genomics,
proteomics, transciptomics, metabolomics, and fluxomics have been made available
which shifted the modeling approaches to a new horizon This also led to a new challenge
in developing models with organism specific information, capable of integrating various
omics data and accounting for the inherent biological functions (Joyce and Palsson 2006;
Kell 2006; Yaspo 2001) Information on fully sequenced genome and their annotation for
different organisms are available in many public databases (Eppig et al 2007; Kanehisa
et al 2002) This abundance of genomic information and omics data has accelerated the
emergence of ‘genome-scale modeling’ which represents a reconstructed metabolic
network at the genome-level (Papin et al 2003) These genome-scale models are
reconstructed by incorporating diverse data including genome annotation, biochemical
reaction information, and cell physiology experiments The term “Genome-scale”
describes the inclusion of metabolic reactions that could be determined to take place in an
organism based on genome annotation and biochemical literature There are many such
genome-scale models available for different organisms and their reconstruction was
based on 60 – 70% of completed genome annotations Table 2.1 shows several such
genome-scale models of organisms with information on genes, number of reactions and
metabolites With such models, analytical frameworks are developed which can be
transformed into mathematical models for describing cellular functions These
Trang 35mathematical representations are important to analyze the metabolism and to identify the
gene functions of the organisms The analyses of such models lead to better
understanding of the systems structure and its behavior (modeling and simulation) and
interpretations based on the analysis and predictions are possible to explain the cellular
functions of the organism Hence, it is possible to identify the genotype-phenotype
relationship
2.3 Constraints-based modeling approach
The challenges involved in genome-scale modeling are mainly addressed by a
constraints-based approach which gives rise to flux balance analysis (FBA) (Bonarius et
al 1997; Varma and Palsson 1994a; Varma and Palsson 1994b), a method for studying
the capabilities of metabolic networks at steady state, using linear optimization
techniques and provide quantitative predictions and testable hypotheses (e.g., optimal
growth rate) Analysis of a genome-scale model using FBA approach involves several
steps (Fig 2.1)
• First step: Reconstructing the network of the organism underlying the metabolism
• Second step: Developing constraints for the model that should reflect the working
state of the organism These include reaction stoichiometry, enzyme capacity,
reaction reversibility and biochemical loops based on thermodynamics
• Third step: Constraints lead to define a solution space in which a solution to the
network equations satisfies physiologically meaningful operating conditions This
solution space contains all possible functions of the reconstructed network or all
Trang 36allowable phenotypes For this, traditionally linear programming using optimization
techniques has been widely used to predict optimal states such as growth and ATP
production
Genome-scale metabolic models using constraints-based approach have been
successfully developed and tested for several organisms like Escherichia coli,
Heamophilus influenzae, Helicobacter pyroli, Sacchromyces cerevisiae, Staphylococcus
aureus, etc
2.4 Other metabolic network simulations
Constraints-based analysis forms the basic platform for carrying out several in silico
analyses to explore the systemic behavior of metabolic networks These techniques and
their merits and demerits are briefly summarized below
Gene deletion studies: In this study, individual reactions associated with the genes are
deleted from in silico models, and the consequences of the deletions can be assessed
(Edwards and Palsson 2000b; Forster et al 2003; Varma and Palsson 1993) Gene
deletions in general modify the allowable states of metabolic network and result in the
reduction of the wild-type solution space These studies have been useful in predicting
the behavior of cell phenotypes of Helicobacter pylori and E coli with an accuracy rate
of 60-90% (Reed and Palsson 2003; Schilling et al 2002) However, gene deletion
studies could also lead to false predictions For instance, if a gene deletion is identified as
lethal and has been experimentally identified as non-lethal, it suggests that there exists an
alternate pathway to fulfill the cell requirements Alternatively, if a gene is identified as
non-lethal in silico and is experimentally found to be lethal, it suggests that another factor
Trang 37Table 2.1 List of available genome-scale models for various organisms
Genes
in Model
Metabolites Reactions Reference Year
Bacteria
400 451 461 Schilling et al 2000
291 340 388 Schilling et al 2002
Trang 38551 604 712 Heinemann et al 2005
Archaea
Eukaryotes
672 636 1,038 Kuepfer et al 2005
800 1013 1446 Nookaew et al 2008
Trang 39Figure 2.1 Genome-scale reconstruction of metabolic network and elucidation of the systemic properties using constraints-based analysis
approach
Trang 40besides the biomass synthesis is causing the gene deletion lethal (Duarte et al 2004)
Such failure modes are very important as these indicate the presence of incomplete
metabolic network and it can help in updating and improving the in silico models for
better prediction and thereby understanding of the organisms’ physiology
Double Knockouts: Double deletion analysis is similar to single gene deletion In this
case, genes are deleted in pairs and the resulting consequences are assessed and compared
with experimental data (Thiele et al 2005) This eventually identifies essential pairs of
genes (the presence of one or both is necessary, but the removal of both leads to lethal
phenotype) Such analyses can predict the underlying mechanism of regulations between
genes
Gene additions: Although gene deletion is the most commonly used approach in
elucidating cellular physiology, gene addition can also be used An expansion in the
wild-type solution space occurs due to the addition of reactions (corresponding to the added
genes) to the network Gene addition studies have been evaluated for applications like
increasing theoretical yield of amino acids in E coli metabolic network (Burgard and
Maranas 2001; Pharkya and Maranas 2006) The results from a wild type strain were
compared with a strain that has access to 3400 additional reactions available in other
species The increase in the production of amino acids was found by the addition of one
or two genes (reactions) only
Extreme pathways: In metabolic networks, the solution space is bounded by some
unique basis pathways called extreme pathways and all possible flux distributions can be
described as linear combinations of these pathways The cellular functions of a biological