63 Figure 19 Random permutation test scores plot for Control and Cataract Disease combined dataset ..... 76 Figure 22 A PCA scores plot and B DModX scores plot for Cataract Disease ESI
Trang 1ORTHOGONAL PARTIAL LEAST SQUARES
DISCRIMINANT ANALYSIS IN METABOLOMICS
FOR KIDNEY AND CATARACT DISEASE
Trang 3Declaration
I hereby declare that this thesis is my original work and
it has been written by me in its entirety
I have duly acknowledged all the sources of information
which have been used in the thesis
This thesis has also not been submitted for any degree
in any university previously
Chew Ai Ping
25 June 2012
Trang 4Acknowledgements
It is my honour to thank the following who have made this thesis possible
Firstly, I thank Professor Sam Li, my main supervisor, for the support and patient guidance these few years, from the start of the project to the end of the write-up for this thesis
I also thank Dr Ong Eng Shi for being the co-supervisor for this project, and for starting me on this project with the kind and thoughtful help in obtaining and running the samples
I also thank Professor Ong Choon Nam, NUS, for kindly agreeing to release the samples, and for his prompt replies to my questions and offering assistance in any way possible
I thank my lab mates for their support in my studies, research, and also for giving valuable advice where needed They are Drs Lau Hiu Fung, Law Wai Siang, Tok Junie, Zuo Xinbing, Wu Huanan, Liu Feng, Grace Birungi, Jiang Zhangjian, Ms Elaine Tay, Ms Fang Guihua, Ms Gan Peipei, Ms Lü Min, Ms Huang Yan, Mr Jon Ashley, Mr Chen Baisheng, and Mr Lin Junyu I also thank Mr Ting Aik Leong, whose help in running the samples has also made this thesis possible
Trang 5I thank the National University of Singapore for giving me the financial support and the chance to take up this degree under the Research Scholarship programme
I sincerely thank the pastors, full-time staff, elders, leaders, and brothers and sisters of the Tabernacle Church and Missions, Singapore, for loving, teaching, guiding, and spurring me towards completing my thesis
I sincerely thank my family for their unfailing support and love given these few years while I undertook studies for my Master’s degree
Finally, all thanks and glory be to God, who has made all things possible through Him and in Him
Trang 6
Table of Contents
Declaration i
Acknowledgements ii
Table of Contents iv
Summary vii
List of Tables viii
List of Figures ix
List of Abbreviations xii
List of Symbols xiii
Chapter 1 Introduction 1
1.1 Metabolomics 1
1.1.1 Overview 1
1.1.2 Metabolomics in Disease Diagnosis 2
1.1.3 Non-targeted and Targeted Approaches in Metabolomics 4
1.1.4 Using Urine for Metabolomic Analysis 6
1.2 Analytical and Separation Techniques in Metabolomics 8
1.2.1 Nuclear Magnetic Resonance 8
1.2.2 Mass Spectrometric Techniques in Metabolomics 10
1.2.3 Separation Techniques in Metabolomics 12
1.2.3.1 Overview 12
1.2.3.2 Gas Chromatography 12
1.2.3.3 High Performance Liquid Chromatography 14
1.3 Chemometrics in Metabolomics 16
1.3.1 Overview 16
1.3.2 Principal Component Analysis 18
Trang 71.3.3 Partial Least Squares/ Projection to Latent Structures 19
1.3.4 Orthogonal Partial Least Squares Discriminant Analysis 20
1.3.5 Pre-treatment of Data for Chemometric Analysis 21
1.4 Chronic Kidney Disease 24
1.4.1 Overview of Chronic Kidney Disease 24
1.4.2 Diagnosis of Chronic Kidney Disease 27
1.4.3 Metabolomics and Chemometrics for Chronic Kidney Disease 30 1.5 Cataract Disease 31
1.5.1 Overview of Cataract Disease 31
1.5.2 Diagnosis of Cataract Disease 32
1.5.3 Metabolomics and Chemometrics for Cataract Disease 33
1.6 Approach and Scope of Study 34
Chapter 2 Materials and Methods 36
2.1 Materials 36
2.2 Urine Sample Collection 36
2.3 Equipment and Procedure for HPLC-MS/MS 36
2.4 Extraction and Normalization of Chromatogram Peak Areas 37
2.5 Chemometric Analysis 38
2.6 Statistical Analysis 39
Chapter 3 Results and Discussion for Chronic Kidney Disease 40
3.1 Results for Chronic Kidney Disease 40
3.1.1 Results for Control vs Chronic Kidney Disease ESI+ Dataset 40
3.1.2 Results for Control vs Chronic Kidney Disease ESI- Dataset 51
3.1.3 Results for Combined ESI+ and ESI- Dataset 60
3.2 Discussion for Chronic Kidney Disease 66
Trang 83.3 Summary 74
Chapter 4 Results and Discussion for Cataract Disease 76
4.1 Results for Cataract Disease 76
4.1.1 Results for Control vs Cataract Disease ESI+ Dataset 76
4.1.2 Results for Control vs Cataract Disease ESI- Dataset 83
4.1.3 Results for Combined ESI+ and ESI- Dataset 90
4.2 Discussion for Cataract Disease 95
4.3 Summary 98
Chapter 5 Conclusion and Future Work 99
References 103
Trang 9Summary
This thesis shows how metabolomics and multivariate statistical methods such as Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) can be used to study and enhance understanding of two diseases
The study utilizes univariate and multivariate statistical techniques to determine the differences in a targeted set of metabolites for healthy controls and two groups of diseased persons Urine samples were collected from healthy controls and patients suffering from chronic kidney disease (CKD) High performance liquid chromatography-tandem mass spectrometry analysis was performed on each sample, and chromatographic and mass spectrometric data were obtained After pre-treatment of the data through normalization and scaling, principal component analysis and OPLS-DA were used to visualize the differences in these two classes Further statistical analysis was employed to determine fluctuations in target metabolites to understand disease pathology, and also to identify potential biomarker candidates for CKD This same method was also employed for a separate group of patients suffering from cataract disease for further validation
The thesis is then concluded with a summary of the main findings, a discussion on the challenges faced, and suggestions for future work in metabolomic studies of CKD and cataract disease
Trang 10List of Tables
Table 1 Description of stages of chronic kidney disease (adapted from [50,
117, 128], originally from [131]) 28
Table 2 Metabolites identified in human urine samples for Controls and CKD
patients in ESI+ mode 48
Table 3 Metabolites identified in human urine samples for Controls and CKD
patients in ESI- mode 58
Table 4 Metabolites identified in human urine samples for Controls and
cataract disease patients in ESI+ mode 81
Table 5 Metabolites identified in human urine samples for Controls and
cataract disease patients in ESI- mode 88
Trang 11List of Figures
Figure 1 Representative TICs (ESI+) of (A) Control, and (B) Patient with CKD.
40
Figure 2 (A) PCA scores plot for Control ESI+ data; (B) DModX scores plot for Control ESI+ data; (C) PCA scores plot for CKD ESI+ data 42
Figure 3 PCA scores plot for Control and CKD ESI+ dataset 43
Figure 4 OPLS-DA scores plot for Control against CKD ESI+ dataset 44
Figure 5 Cross-validation scores plot for Control and CKD ESI+ dataset 45
Figure 6 Random permutation test scores plot for Control and CKD ESI+ dataset 46
Figure 7 (A) VIP and (B) Loadings plot for Control-CKD ESI+ dataset Interval bars denote the jack-knife confidence intervals for each metabolite 50
Figure 8 Representative TICs (ESI-) of (A) Control, and (B) Patient with Chronic Kidney Disease 51
Figure 9 (A) PCA scores plot for Control ESI- data; (B) DModX scores plot for Control ESI- data; (C) PCA scores plot for CKD ESI- data 53
Figure 10 PCA scores plot for Control and CKD ESI- dataset 54
Figure 11 OPLS-DA scores plot for Control against CKD ESI- dataset 55
Figure 12 Cross-validation scores plot for Control and CKD ESI- dataset 55
Figure 13 Random permutation test scores plot for Control and CKD ESI- dataset 56
Figure 14 (A) VIP and (B) Loadings plot for Control-CKD ESI- dataset Interval bars denote the jack-knife confidence intervals for each metabolite 59 Figure 15 (A) PCA scores plot for Control combined ESI+/ESI- data; (B) DModX scores plot for Control combined ESI+/ESI- data; (C) PCA scores plot for CKD combined ESI+/ESI- data 61
Figure 16 PCA scores plot for Control and CKD combined dataset 61
Figure 17 OPLS-DA scores plot for Control against CKD combined dataset 62 Figure 18 Cross-validation scores plot for Control and CKD combined dataset 63
Figure 19 Random permutation test scores plot for Control and Cataract Disease combined dataset 64
Trang 12Figure 20 (A) VIP and (B) Loadings plot for Control and CKD combined
dataset Interval bars denote the jack-knife confidence intervals for each metabolite 65
Figure 21 Representative TICs (ESI+) of (A) Healthy Control, and (B) Patient
with Cataract Disease 76
Figure 22 (A) PCA scores plot and (B) DModX scores plot for Cataract
Disease ESI+ data 77
Figure 23 PCA scores plot for Control and Cataract Disease ESI+ dataset 78 Figure 24 OPLS-DA scores plot for Control against Cataract Disease ESI+
dataset 79
Figure 25 Cross-validation scores plot for Control and Cataract Disease ESI+
dataset 79
Figure 26 Random permutation test scores plot for Control and Cataract
Disease ESI+ dataset 80
Figure 27 (A) VIP and (B) Loadings plot for Control-Cataract Disease ESI+
dataset Interval bars denote the jack-knife confidence intervals for each metabolite 82
Figure 28 Representative TICs (ESI-) of (A) Healthy control, and (B) Patient
with Cataract Disease 83
Figure 29 (A) PCA scores plot for Cataract Disease ESI- data; (B) DModX
scores plot for Cataract Disease ESI- dataset 84
Figure 30 PCA scores plot for Control and Cataract Disease ESI- dataset 85 Figure 31 OPLS-DA scores plot for Control against Cataract Disease ESI-
dataset 86
Figure 32 Cross-validation scores plot for Control and Cataract Disease ESI-
dataset 86
Figure 33 Random permutation test scores plot for Control and Cataract
Disease ESI- dataset 87
Figure 34 (A) VIP and (B) Loadings plot for Control-Cataract Disease ESI-
dataset Interval bars denote the jack-knife confidence intervals for each metabolite 89
Figure 35 (A) PCA scores plot and (B) DModX scores plot for Cataract
Disease combined ESI+/ESI- data 90
Figure 36 PCA scores plot for Control and Cataract Disease combined
ESI+/ESI- dataset 91
Trang 13Figure 37 OPLS-DA scores plot for Control and Cataract Disease combined
Figure 40 (A) VIP and (B) Loadings plot for Control and Cataract disease
combined datasets Interval bars denote the jack-knife confidence intervals for each metabolite 94
Trang 14List of Abbreviations
least squares
Trang 15List of Symbols
the model
variation in the Y matrix, representing the predictive ability of
the model
the total explained variance in the model
cumulative modelled variation in the Y matrix, and
representing the goodness of fit of the model in explaining the variation by the components in the model
OPLS-DA model, also representing within class variation
representing between-class variation
Trang 17of such small molecules include simple sugars, fatty acid amides, and amino acids The presence of and quantity of these metabolites are a reflection of what goes on within and outside the cell The goal of metabolomics is not only
to determine the disease pathology, the role of metabolites in the biochemistry
of the organism, and potential biomarkers, but also ultimately to determine the molecular structure of these biomarkers [3] Overall, metabolomic studies greatly aid our understanding of the biology of an organism at a systems level
Nicholson et al in their landmark paper have defined metabolomics as the
study of the “dynamic multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification” [4], helping us to understand how living systems actually work Specifically, when organisms are under a state of stress as a result of disease (“pathophysiological stimuli”)
or perturbations to the genetic content of the cells (“genetic modification”), metabolomics as a discipline becomes useful [5] Knowledge of the cellular response under such conditions helps researchers identify potential therapeutic targets In this manner, therapy does not end with just symptomatic treatment to just address the metabolic flux but indeed has the end-goal of a total cure in mind
Trang 18Metabolomics, or metabonomics, has been gaining new ground in the field of systems-level “omics” research [6], i.e genomics, transcriptomics, and proteomics It is a relatively new area of study compared to its sister disciplines [7] and complements these fields [8] The advantage of metabolomics over its counterparts is that the metabolome is much more closely correlated to the actual cellular response than the genome or proteome [3, 7, 8] Also, the amount of data generated is less due to the lower number of metabolites compared to the number of genes or proteins [2, 7] In addition, each metabolite may be involved in one or several pathways, contributing to the complex expressed phenotype of the organism It is this set
of downstream biochemical pathways, and not just a single pathway, that metabolomics aims to map out [2, 8] This allows us to obtain a more timely and accurate understanding of cellular and systemic processes
1.1.2 Metabolomics in Disease Diagnosis
Metabolomics is increasingly becoming a valuable tool to study disease pathology, and to screen, diagnose and determine the effect of treatment on diseases as well A wide variety of diseases is being studied, with the analytical methods used varying with the disease and the aim of the study
For example, Jung et al have successfully used proton NMR (1H-NMR) and targeted metabolic profiling with multivariate analysis to distinguish patients with cerebral infarctions from healthy controls by analysis of urine and plasma
[9] Also, Kim et al have combined toxicology with metabolomics to determine
urinary biomarkers for human gastric cancer using a mouse model [10] In a
Trang 19recent study, Bao et al have also devised a novel method of measuring the
systemic effects of various drug treatments on type 2 diabetes mellitus (T2DM) instead of just obtaining the conventional glucose measurement for T2DM [11] Further, in an attempt to obtain a comprehensive understanding of and to
diagnose renal cell carcinoma, Kind et al have successfully utilized various
separation techniques coupled with mass spectrometry and subsequent multivariate analysis to analyze and discriminate patient urine from healthy controls in a small pilot study [3] Given the increasing number of parameters
in metabolomic analysis, there is an even greater need for reliable and informative multivariate techniques to analyse this data
The combination of multivariate statistical tools with metabolomics has been shown to be powerful for disease screening involving non-targeted
determinations One such study of interest is that by Michell et al In their
metabolomic analysis of Parkinson’s disease patient serum and urine samples, they were able to separate female Parkinson’s patients from their age-matched controls using partial least squares discriminant analysis (PLS-DA) based on the urine data, despite not finding strong individual biomarkers responsible for this separation They surmise that there is a unique metabolic pattern of Parkinson’s disease contributed by certain metabolites [12] Also, in
a separate study by Kemperman et al., they observe that while multivariate
statistical analysis was able to show discriminatory peptide peaks, univariate analysis failed to show these as discriminatory due to “a very large biological variation among the proteinuric patient group” [13] These studies show the
Trang 20necessity of multivariate techniques in view of the nature of samples and data obtained
1.1.3 Non-targeted and Targeted Approaches in Metabolomics
There are two general approaches towards metabolomic studies – targeted and targeted Non-targeted or global profiling approaches in metabolomics aim to capture as many features of an organism’s metabolic profile as possible This approach allows researchers to obtain a holistic picture of the types and concentrations (relative or absolute) of the metabolites, so that comparisons can be made between study groups in order
non-to determine patterns of changes which are useful for diagnosis [14]
Non-targeted approaches as that in metabolic fingerprinting may not identify the specific metabolites involved in disease pathology, but consider the total combination of analytes and their concentrations in totality [15] This approach allows for the “simultaneous analysis of multiple end products”, allowing for a
“more powerful and robust means by which to stratify disease severity, progression and to assess drug efficacy than the analysis of any single
marker over a patient population” [16] For example, Vallejo et al have used
capillary electrophoresis coupled with ultraviolet detection and subsequent metabolic fingerprinting to distinguish between normal rats and diabetic rats
on antioxidant treatment [17] Issaq et al have also successfully utilized
metabolomic profiling with high performance liquid chromatography-mass spectrometry (HPLC-MS) to detect bladder cancer using urine samples in their proof-of-concept study Their study does not use the traditional
Trang 21techniques which are less sensitive towards low-grade tumours (i.e through urine cytology) or more invasive in terms of methodology (i.e cystoscopy) [5] Novel biomarkers may also be identified in non-targeted approaches, e.g by structural studies through NMR or tandem mass spectrometry
Given the knowledge of metabolites and their interactions in specific biochemical pathways, one can also capitalise on targeted approaches to study specific metabolites or groups of metabolites [18] using reference spectra for analysis [19] The duration for post-acquisition data processing and identification of metabolites are shorter as well [19] Metabolite and pathway databases and search engines such as the Human Metabolome Database [20], Kyoto Encyclopaedia of Genes and Genomes database [21, 22] and the METLIN Metabolite Database [23] are useful resources in this area of pathway analysis Researchers can also make use of targeted analysis to determine how the concentrations of particular metabolites in a system vary with concentration changes of other metabolites For example,
Grison et al have successfully used targeted profiling to determine a metabolic signature for chronic caesium exposure [24] Also, Wu et al have
compared the metabolite profiles of salt-tolerant and salt-intolerant soybean plants, and through multivariate analysis, have found that secondary metabolites such as isoflavones and saponins distinguished these two varieties [25] One limitation of this targeted approach is that since only known metabolites can be identified and quantified, it is not possible to discover novel compounds as biomarkers through this approach [18] Yet, the
Trang 22numerous successes using this approach show that there is a need and use for such targeted studies
1.1.4 Using Urine for Metabolomic Analysis
While many types of body fluids (biofluids) have been used for metabolomic studies, the choice of biofluid is highly dependent on the disease being studied The choices of biofluid include blood serum [12, 26-29], plasma [27, 30-32], cerebrospinal fluid [33], urine [3, 5, 7, 12, 29, 34-43], saliva [44, 45], tears [46], and even vitreous humour [15] Urine has an advantage of being easily obtained in large enough volumes for multiple analyses [1, 18, 47] It is also one of the least invasive body fluids to collect from patients [10], allowing for multiple collections at different times [18], and at the minimal level of discomfort to study subjects [1] Furthermore, urine is the biofluid through which the majority of metabolic waste products are excreted from most organ systems in the body and therefore can provide much information about the body’s biochemical processes as a whole system [3], since it is not subject to strict homeostatic regulation as is serum [48] In addition, obtaining or preparing urine samples is usually more straightforward than for other biofluids such as blood [49], serum [18], plasma [1], or tears
In addition, the concentrations of metabolites are often higher in urine [47], which makes it easier for determination and detection It has also been found that in the study of renal diseases, measurements of kidney function are generally more accurate when using urine measurements than plasma, provided sufficient and accurate volumes of urine samples can be obtained
Trang 23[50] Metabolic changes that take place at the cellular level are easily reflected
in the urine, as, other than blood, it is the biofluid which most of the kidney is exposed to [2, 3]
Urine as a biofluid for analysis, however, also has its disadvantages There may be large variations in terms of volume and therefore the degree of dilution of metabolites [17], resulting in a very wide dynamic range [1] and concentration differences of five-thousand fold or more [17, 51] These differences represent natural variation, and may be exacerbated under conditions of disease [1] In addition, as with other body fluids, the concentrations of metabolites may not correspond to their importance in disease pathology [17] Also, xenobiotics may be present [1], and these may
or may not be directly related to the organism’s core metabolism; if so, they may provide valuable information on the varied interactions of the organism with its environment The analysis method chosen must therefore be able to deal with these problems associated with metabolic studies involving urine, in addition to being reliable and reproducible [14]
Despite these limitations in terms of variation of urine volume, metabolite concentration differences, and the presence of xenobiotics, urine has been one of the choice candidates for metabolomics studies This is because the advantages of using urine for this current study far outweigh the disadvantages, as will be discussed further in the foregoing sections and is therefore the choice of body fluid for this study of chronic kidney disease (CKD)
Trang 241.2 Analytical and Separation Techniques in Metabolomics
1.2.1 Nuclear Magnetic Resonance
The expanding area of research in metabolomics can also be attributed to the improvement in technologies that allow for sensitive, specific, and reproducible studies to be carried out Traditionally, NMR was, and still is, a major analysis technique employed for the purposes of mapping the metabolome [18, 52] The main advantage that NMR affords is its reproducibility [53] over different runs and across different instruments [2] and its ability to detect a wide range of metabolites [19], allowing for the building of compound libraries [46] Furthermore, sample preparation is usually minimal [2, 19] and non-destructive [8], and analysis times are short as well [8] NMR
is also able to analyse intact tissue through high-resolution magic-angle spinning [54]
In addition, NMR allows for the molecular structures of biomarkers to be discerned in two-dimensional structural studies [55] It allows researchers to determine metabolite profile patterns through metabolic fingerprinting to classify groups of subjects without actual identification of the molecules
involved [53] For example, Brindle et al have used 1H-NMR to successfully profile human serum for the accurate diagnosis of coronary heart disease [56],
while Keun et al have successfully used 13C-NMR to investigate urine in
metabolomic studies [57] Further, Kang et al have also successfully used
NMR with orthogonal partial least squares discriminant analysis (OPLS-DA) –
a multivariate statistical tool – to discriminate between Korean and Chinese herbal medicines [58].As the number of variables being analysed increases, it
Trang 25is apparent that multivariate tools become necessary in order to obtain a more complete understanding of the systems being studied These multivariate tools must also allow for a logical and systematic way of handling the information obtained It is with this thought in mind that multivariate statistical techniques feature in our study, and which will be further reviewed in this chapter
However, a main drawback of NMR is its inherent lack of analytical sensitivity [2, 8, 46], which results in the inability to detect metabolites which have a concentration lower than 5 µM [18] Spin-spin coupling also causes
complications in data interpretation [19] Several recent advances in NMR technology include microprobes and miniature probe coils for smaller volumes
of sample [53] and cryoprobes for better sensitivity and shorter acquisition times [53, 59] However, high-throughput profiling does not seem possible if the issues of complicated spectra and difficult compound identification are to
be resolved [19] In addition, the high cost and space requirements of
equipment [48] may also mean that not all laboratories will be appropriately equipped Therefore, while NMR is a very useful and powerful technology for metabolite profiling, it does not allow for high-throughput studies on
metabolites of very low concentrations In view of these considerations, other analytical methods such as mass spectrometry and chromatographic
separation techniques need to be considered
Trang 261.2.2 Mass Spectrometric Techniques in Metabolomics
MS is a useful and often necessary tool for the identification and quantification
of metabolites in metabolomics investigations [2] through their molecular fragmentation patterns [18] It also has a higher sample salt tolerance than NMR and has improved to reach picomole detection levels [2] In addition, MS with its inherent higher sensitivity than NMR also means that it is usually used for targeted studies [8] It must, however, be noted that MS and NMR techniques are complementary as each has their own limitations and advantages
For MS, the choice of ionisation techniques is very important as it needs to be suitable for the type of metabolites under study Commonly used ionization techniques include electron ionization (EI), chemical ionization (CI), and atmospheric ionization For EI, it is the most commonly used ionization technique with gas chromatography (GC), but suffers from the lack of molecular ions for some compounds [60] A common alternative would therefore be CI, which helps to produce the molecular ions for compounds that do not do so with EI [60] The most commonly used atmospheric ionization technique, electrospray ionization (ESI) [61], is suitable for the metabolomic analysis of urine as the metabolites are usually polar or ionic The lack of extensive fragmentation also means that molecular ions can be detected with high sensitivity [1] ESI also has both positive and negative-ion modes, allowing for wider coverage of metabolites [1] ESI is therefore the choice of ionisation technique for this study
Trang 27The choice of mass analysers is also dependent on the type of metabolic analysis being carried out If “global, untargeted metabolic profiling studies” are to be carried out, high resolution mass analysers such as time-of-flight (TOF) and quadrupole-TOF (Q-TOF) instruments are suitable for resolution of
“co-eluting metabolites having the same nominal mass” [1] On the other hand,
if targeted studies on a select group of metabolites are to be carried out, resolution mass analysers such as single quadrupole, triple quadrupole, and ion trap instruments are sufficient for detection and quantification of the metabolites being investigated [1] Triple-quadrupole instruments are usually chosen over single-quadrupole mass analyzers due to the former’s higher sensitivity and ability for selective reaction monitoring [62] As this study utilises a targeted approach, a triple quadrupole mass analyser would be sufficient for detection and quantification of the metabolites under study
low-As the composition of biofluids is highly complex, it is advantageous for metabolomic studies to utilise separation techniques prior to MS analysis Direct infusion techniques have been used in metabolomics to determine metabolic profiles in both plants [63] and animals with high sensitivity and selectivity [53], but the quality of chromatograms is usually adversely affected
by matrix effects [18, 46], incomplete ionization [18], or ion suppression or enhancement [64] Although direct injection techniques such as desorption ESI and extractive ESI could be used to counter matrix effects encountered in urine analysis [18], coupling MS with an orthogonal separation method is allows for a more complete and accurate measurement of the metabolites present [1] In light of these considerations and the complexity of the samples
Trang 28obtained, it was decided that a separation method prior to ionization was necessary to improve the quality of data obtained
1.2.3 Separation Techniques in Metabolomics
1.2.3.1 Overview
The advancement of metabolomics has also been largely supported by developments in separation and mass spectrometric technologies As the number of variables under investigation increases, there is a need to distinguish and analyse each metabolite separately so that a more accurate understanding of the condition being studied can be obtained [2] Separation technologies include chromatographic methods such as HPLC, gas chromatography (GC), and capillary electrophoresis Other extraction methods also have earned favour with metabolomics, and these include solid phase extraction [65] The use of GC and HPLC in metabolomics will be discussed in more detail in this sub-section
1.2.3.2 Gas Chromatography
GC, specifically capillary GC, has been widely used in the field of metabolomics in conjunction with mass spectrometry due to the reproducible
high quality spectra obtained using this method [66] As Kind et al note,
GC-MS has been extensively used since the 1970s [3] Compound libraries for reference can be compiled in-house, obtained from commercial sources, or imported from other external sources such as the National Institutes of Science and Technology [61] These extensive libraries available make the combination of GC-MS a tool of choice for identification of metabolites [46]
Trang 29GC-MS has been widely used in metabolomic studies involving multivariate
determination of the diseased state through analysis of urine Halket et al
have explored a method of determining urinary organic acids using GC-MS with pattern recognition techniques to identify metabolic disorders [67] Zhang
et al have also successfully used multivariate OPLS-DA modelling to
determine 40 differentiating metabolites for osteosarcoma in GC-MS analysis
of serum and urine, as well as discern the energy metabolism disruptions
through their targeted analysis [29] More recently, Pasikanti et al have
devised and validated a method where GC-MS was coupled with principal component analysis (PCA) and OPLS-DA to differentiate between genders based on a global metabolomic analysis on urine samples [49] These studies show the utility of GC-MS coupled with multivariate statistical tools in metabolomic studies
However, for GC, sample derivatization is necessary to obtain volatile forms
of the non-volatile analytes [18] As in the study by Pasikanti et al., sample
derivatization using BSTFA and the presence of co-eluting compounds made
it difficult to identify especially the low-abundance metabolites [18] Sample pre-treatment to remove interfering molecules and to enhance the concentration of desired metabolites is therefore one main limitation associated with techniques such as GC-MS [18] as well as HPLC-MS Many metabolites in urine are non-volatile and polar or ionic, and tedious sample derivatization is required prior to GC analysis [1] in order to decrease their polarity and increase volatility [66] Thermal degradation of metabolites may
Trang 30also occur due to the high temperatures utilised in GC [46], further warranting the need for sample derivatization [66] In the case of urine, urease treatment
is necessary to protect the column and enhance the quality of spectra obtained [18, 66] These preparation steps complicate and lengthen the total amount of time needed for analysis, and unwanted artefacts may also be introduced
1.2.3.3 High Performance Liquid Chromatography
Compared with GC, high performance liquid chromatography (HPLC) has also
been extensively used in research, as reviewed by Kind et al [3] As with GC,
HPLC is frequently coupled with MS in analysis The combination of orthogonal separation techniques improves separation and identification of metabolites [3]
Once thought of as only a potentially powerful tool for metabolomics [68], HPLC-MS has proven to be very useful in this field For example, Jia et al
have used HPLC-MS to determine the plasma phospholipid profiles of mice with Immunoglobin A nephropathy – which is “the most common form of glomerulonephritis” [69] They have found that the combination of HPLC-MS with multivariate modelling by PCA and partial least-squares discriminant analysis (PLS-DA) is successful in differentiating healthy mice from their diseased counterparts and identifying relevant biomarkers [69] The inherent sensitivity, specificity and efficiency of MS [69] coupled with the high peak capacity of HPLC have made it possible to accurately determine large
numbers of metabolites in a short length of time [46] In other works, Plumb et
Trang 31al have successfully used HPLC-MS to screen rat urine in drug development
and detect drug metabolites in biological fluids [31, 70] Idborg-Björkman et al
have also used HPLC-MS and two-way data analysis to screen for biomarkers
in rat urine [71]
Similarly, Yang et al have similarly used HPLC-based metabonomics in the
diagnosis of liver cancer to decrease the false-positive rate [72], and further built on this work by exploring strategies for HPLC-based metabonomics
research [73] Furthermore, Chen et al in their targeted analysis of urine
metabolites have utilised Rapid Resolution Liquid Chromatography (RRLC) and multivariate analysis, and leveraged on metabolic correlation networks to determine potential biomarkers of breast cancer and gain a greater
understanding on the interactions between the putative biomarkers [74] Yin et
al have also used a similar method to study liver cirrhosis and hepatocellular
carcinoma [75] Therefore, HPLC has proven to be a necessary and powerful tool for studies involving disease screening and diagnosis
In addition, urine, which is the choice of biofluid for this study, is particularly suitable for analysis by reversed-phase HPLC-MS [1] As mentioned previously, urine contains many dissolved non-volatile analytes with various degrees of polarity Urine can be injected directly into the column either in a diluted or neat form [14] Apart from removing particulates and appropriate dilution, there is minimal sample preparation for HPLC-MS analysis of the low molecular weight metabolites [1, 14] Although compound identification is more difficult than in GC-MS due to the lack of standard reference spectra
Trang 32libraries [61], the use of reference standards coupled with the use of metabolite databases containing reference spectra alleviates this difficulty associated with HPLC-MS
Therefore, comparing GC-MS and MS, it is felt that the use of
HPLC-MS would be more advantageous due to the nature of the target metabolites for this study There is no need for derivatization of analytes in HPLC-MS, unlike GC-MS [76], reducing the likelihood of mistakes introduced in the sample preparation step, and increasing the likelihood of accurately identifying novel compounds [48] Also, the overall duration for sample analysis and post-processing would be lower as time is not needed for sample derivatization
1.3 Chemometrics in Metabolomics
1.3.1 Overview
The trend towards having many variables (such as chromatographic and spectroscopic information) describing one observation (each sample) has been fuelled by the advances in the above-mentioned analysis technologies,
as well as a need to determine disease pathology not in terms of only one metabolite but as a combination of metabolite responses in various physiological states Moreover, there is a large dynamic range of metabolite concentrations, and the most important metabolites may not be the most abundant ones [77] Appropriate multivariate chemometrics tools therefore have to be employed in order to summarise, interpret, and visualise this wealth of data generated from metabolomic experiments [77]
Trang 33Chemometrics techniques are useful tools to help the researcher understand the acquired data The underlying principle of chemometrics technologies is this – mathematical operations and transformations are used to determine if there are underlying patterns or trends in multi- or megavariate data, known
as latent variables [78] These latent variables are able to summarise the variation shown in the data as the absolute data obtained are usually highly collinear [78] In addition, these latent structures may or may not be the
variables being measured themselves Alternatively, a priori information
about the sample can be used in the analysis, and the data summarised to show whether the measured variables are correlated with the known information Multivariate analysis methods are therefore necessary as they can help researchers to determine underlying patterns across large sets of data, and also because disease causation is seldom due to a single metabolite, but are usually multifactorial in origin [2]
Many chemometrics techniques have been used in the field of metabolomics These multi- and megavariate techniques can be broadly divided into a few categories, namely supervised and unsupervised methods These include hierarchical classification analysis [79], PCA [80], linear discriminant analysis, PLS methods [81], OPLS [82], soft independent modelling of class analogy [83], support vector machines [84] and artificial neural networks (ANNs) [85] The tool of choice is largely dependent on two factors: the nature of data obtained, and the nature of information that is needed about the data obtained Given these tools, the researcher must still make wise decisions based on
Trang 34careful experimental and study designs, and the researcher’s assumptions will also affect the kind of data-processing performed using these chemometrics tools
1.3.2 Principal Component Analysis
The tool of choice for preliminary visualisation of underlying trends in data is PCA [86] PCA has been described as the “workhorse of chemometrics” [87] Indeed, PCA renders itself useful as it allows researchers to not only visualise data in a reduced dimensionality, it also shows up any groupings or clustering
in the samples observed, representing similarities in samples [2, 34] This also explains its use in what is known as exploratory data analysis [86], and is useful in giving the researcher an overview of the data [17]
In addition, PCA is also able to show differences in data by calculation of orthogonal components to maximise the variance [5], represented as separation between different groups or clusters along the orthogonal components [2, 5] Also, outliers are easily recognised in the score plots, and these inform researchers if there is a need to remove these samples [12, 86]
If classification information about the samples is unavailable, PCA can be used to show the trends and patterns that allow us to classify the samples accordingly, and carry out further analysis Examination of the loadings plots will usually reveal the variable or variables most responsible for the groupings observed [48] It can be seen that PCA is an unsupervised technique, as class information is not used in data analysis, as is done in supervised models
Trang 35PCA can also show whether such between-class variations are significant enough to outweigh the within-class variation
However, as stated, PCA is but a preliminary visualisation method Being an unsupervised technique, it tends to not be able to separate samples due to large chemical noise and other sources of variation which may not be relevant and also distracting, such as instrumental drift and artefacts [3] There may also be cases where the intra- or inter-subject variation is too large for clustering to take place There is therefore a need for supervised chemometric tools which can help researchers focus on the relevant sources of variation that are being studied [77]
1.3.3 Partial Least Squares/ Projection to Latent Structures
PLS, on the other hand, is an example of a supervised classification tool which utilises known class information in data analysis PLS is a powerful form of multivariate analysis as it is able to handle data which are “strongly
collinear, noisy, and [contain] numerous X-variables” [88] For metabolomics
studies, there are often two or more groups of samples being studied – these could be control and diseased, different phases of disease, or different types
of treatment These dependent variables are collectively termed as the Y
matrix, and may be discrete or continuous [77] Discriminant analysis may
also be applied where the Y variables consist of variables denoting group
belonging PLS when combined with discriminant analysis (PLS-DA) maximises class separation and builds prediction models based on this information given [2, 17]
Trang 361.3.4 Orthogonal Partial Least Squares Discriminant Analysis
OPLS is also a supervised classification tool that is a relatively new extension
of PLS [82] Where class information is known about the samples, OPLS becomes a very powerful dimension-reduction and visualisation tool It affords better interpretability and transparency compared to PLS, as the PLS model is rotated [89] so that the variation in the data is separated into two components
– those that are related to the Y matrix (and therefore class separation [77]), and those that are unrelated (orthogonal) to the Y matrix [77, 90-92] Such
separation of components is important, as it allows researchers to understand the main causes of variation that separate these two classes of samples by relating it with known variables [34]
Like PLS, OPLS can also be used in conjunction with discriminant analysis in metabonomics studies as well [93] In this case, the values in the descriptive
Y matrix are dummy variables used solely to assign class belonging [94]
OPLS-DA therefore increases class separation, interpretation, and identification of the metabolite information [34, 91] For example, Whelehan
et al have used it to detect ovarian cancer through analysis of the proteomic
profiles of 191 subjects [91], while Qiu et al have used it to diagnose human
colorectal cancer [26] These improvements therefore allow OPLS-DA to be used as a chemometric tool for disease diagnosis
For our investigation, we have chosen to use PCA and OPLS-DA as they are linear methods, and produce models which are more easily interpreted than
Trang 37those of non-linear methods such as ANNs [95, 96] Also, as stated by
Wiklund et al., “multivariate models such as PLS and OPLS include both
statistical significance based on cross-validation and confidence intervals based on jack-knifing estimations as well as magnitude and reliability of the data provided by good visualization” [77] These multivariate models more powerful than univariate tests [77] as the latter does not show how groups of biomarkers are more powerful than the individual biomarkers themselves [97]
1.3.5 Pre-treatment of Data for Chemometric Analysis
The data that are used in such pattern recognition techniques must also necessarily be properly processed prior to the use of these techniques, otherwise spurious correlations and patterns might be mistakenly identified For example, in the case of HPLC-MS data, peak retention times between different runs must be aligned in the pre-processing analysis as retention times and baselines may deviate from run to run [46] Also, data reduction in the form of peak-picking is necessary such that only the true analytical peaks remain, reducing the noise in the spectra and correlations to structures unrelated to class information [62]
Furthermore, it is imperative for the researcher to understand clearly the nature of the data obtained so that the appropriate peak alignment tools and scaling methods can be employed Projection-based multivariate methods are sensitive to the scaling of the data [77, 92], which is in turn dependent on the data acquired for all samples Scaling methods that could be used include auto-, mean-centred, pareto, and level scaling [17] Weiss and Kim in their
Trang 38review of metabolomics in kidney disease also lend support that metabolomic data need to be suitably scaled and transformed to attain symmetry and normality [2]
In metabolomics, normalization is carried out in order to reduce systematic errors in the data so that biologically significant changes in metabolite concentrations may be discovered [98, 99] Normalization also helps to make the data obtained across samples comparable in size, such as correcting for urine dilution effects [100] There are generally six different methods of normalization for mass spectra: (1) with reference to mean, (2) with reference
to median, (3) linear rescaling according to the largest and smallest values, (4) with reference to total ion count, (5) with reference to the peak of maximum intensity, and (6) with reference to an internal standard ([101, 102], cited in [103])
For studies using urine as the biofluid, there are also the options of normalization to total urine volume, urine osmolality, range of peak intensities,
or to an endogenous compound such as creatinine [17, 18, 99] Depending on the nature of the dataset, the choice of normalization method is important as it also affects the identity and ranking of biomarkers found [103] It is known that urine samples have a wide dynamic range in terms of the metabolite concentrations Therefore, even though it has been found that there is usually
a high level of consistency among methods in terms of the important compounds identified [103], there is still a need to choose the best method of normalization
Trang 39Normalization to total urine volume is usually not ideal as there are several limitations in using this as the reference This method tends to introduce errors such as those due to inaccurate sample collection by controls and patients, whether for 24-hour collections (especially for children [104]) or for spot collections It is also known that urine volume can be affected by hydration level ([105, 106], cited in [34]) Also, as the concentrations of metabolites changes with the total urine volume, normalization to total urine volume may result in spurious correlations between metabolite concentrations and disease state Therefore, normalization to urine volume is not usually the choice of reference as it varies too widely for meaningful comparisons to be made, whether in an intra- or inter-subject manner [99]
In addition, it is not recommended to use a single compound such as creatinine as the reference [13], since it may vary widely across individuals,
as is so especially in cases of kidney disease [18, 47] Although urinary creatinine excretion for each person is relatively stable, it is not a good reference as many factors can affect its concentration in urine Significant changes in urine creatinine within a so-called healthy population have been found in the study conducted by Saude et al [47] It has also been shown that creatinine fold change among the wider population can be highly varied as well [104] Further, it has been found that urinary excretion of creatinine is affected especially in kidney disease due to its degradation in the body [18, 50] These also lend support to the choice of not using urinary creatinine as
Trang 40the choice of internal reference for the normalization of metabolite concentrations
As the correct choice of normalization can improve differentiation between study groups, it is felt that normalization to a form of total ion count is appropriate This was observed in a study by Warrack et al [99], which showed improved discrimination between dose groups in their study They recommend that urine samples be normalized to both the ‘mass spectrometry total useful signal’ (MSTUS) as well as osmolality In their study, it has been found that normalizing to total urine volume or to creatinine levels actually caused the group separation to become unclear, hence the recommendation
to use osmolality and the MSTUS instead [99] Therefore, the current work in
normalizing to the 40 targeted metabolites and four m/z regions, instead of
normalizing to creatinine or urine volume, finds its support here as well
1.4 Chronic Kidney Disease
1.4.1 Overview of Chronic Kidney Disease
CKD is a “life-threatening condition characterized by progressive and irreversible loss of renal function” [107] The term itself is non-generic, and represents the declining kidney function which arises from various diseases [108] In terms of pathophysiology, kidney disease manifests itself in changes
in the “glomerular filtration rate (GFR), glomerular permeability, tubular function, tubular damage, urinary reflux, obstruction to urinary flow, and
deposition of collagen” [109] According to Eknoyan et al., it is a disease
which affects 5-10 % of the world population [110] Sabanayagam has also