Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level.
Trang 1R E S E A R C H A R T I C L E Open Access
Data-driven human transcriptomic modules
determined by independent component
In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis We apply independentcomponent analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomicmodules that can be used as features for machine learning We evaluate the usability of these modules across sixstudies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing withsmall training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of
differentially expressed features
Results: We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs) Instudies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, withhigher sensitivity (p < 0.01) The models also had higher accuracy and negative predictive value (p < 0.01) forsmall data sets (less than 30 samples) Additionally, we observed a reduction in batch effects when data isclustered in the FC-space Finally, we found that differentially expressed FCs mapped to GO terms that werealso identified via traditional gene-based approaches
Conclusions: The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their
performance in low sample settings suggest that they should be employed in such studies in order to harnessthe data efficiently
Keywords: Independent component analysis, Gene expression, Functional modules, Transcriptome
Background
The human transcriptome, a snapshot of all mRNA
mole-cules in a cell or tissue, is invaluable in advancing precision
medicine Many public databases have been established to
map drug responses to transcriptomic profiles, such as the
Welcome Trust Sanger Institute’s Cancer Genome Project
(CGP), the Connectivity Map (CMap) [1] and the Library
of Network-based Cellular Signatures (LINCS) While the
ability to measure gene expression levels of nearly every
expressed gene in a cell allows for precise characterization
of tissues at the molecular level, transcriptomic data isinherently noisy due to the dynamic nature of transcription.This makes it difficult to identify patient subtypes when theeffect size is small, and also confounds direct interpretation
of analysis results Statistical methods to handle suchhigh-dimension data typically control the false discoveryrates through p-value corrections and q-value thresh-olding, or increase power via the simultaneous study
of multiple genes (i.e gene sets) Gene set enrichmentanalysis (GSEA) [2, 3] is widely used today in tran-scriptomic analysis, and is facilitated by the MolecularSignatures Database (MSigDB) [3], a database with
* Correspondence: russ.altman@stanford.edu
1 Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
2
Department of Genetics, Stanford University, Stanford, CA 94305, USA
© The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made The Creative Commons Public Domain Dedication waiver
Trang 217,779 gene signatures across seven collections In practice,
researchers running GSEA typically choose a particular
subset or collection from MSigDB based on what they
believe to be related to the tissue or condition of the
sam-ple This may induce biasness in the analysis as GSEA is
sensitive to the choice of subsets used [4], and also the
amount of gene filtering steps done during preprocessing
[5] Furthermore, Liberzon et al [6] also found significant
redundancies in MSigDB’s signatures, which can skew the
reported enrichment scores from GSEA
An alternative to using user-defined gene sets is to
em-ploy data-driven approaches to construct lower-dimension
features so that the statistical power can be increased
Principal component analysis has previously been
per-formed on microarray data to summarize the
experimen-tal dataset containing tens of thousands of genes to a
feature space that is a hundred-fold smaller [7,8] It was
observed that while the first three or four principal
com-ponents can sufficiently capture most biological signals in
a large microarray dataset [9–11], they fail to do so when
there is a small effect size and/or when there is a small
number of samples exhibiting the effect [12] Recent
ana-lysis performed by Tan et al on Pseudomonas aeruginosa
gene expression profiling experiments [13, 14] showed
that components derived from PCA had fewer associated
biological pathways than components from competing
methods such as ICA More fundamentally, the
under-lying assumption of PCA (that the data is a multivariate
Gaussian) does not hold for transcriptomic data, which are
typically super-Gaussian Lee and Batzoglou [15] suggested
the use of a related technique, independent component
analysis (ICA), as a more faithful model for such
non-Gaussian data The statistically independent components
obtained from ICA have been reported to have biological
significance [16–18], and are alternatively known as
meta-genes, transcriptomic modules or functional components
(FCs) Unlike gene sets, where a gene’s membership is
binary, functional components present a smoothed and
continuous version of set membership, better reflecting the
complex network and co-dependency of genes A sample
transcriptome can then be expressed as a linear
combin-ation of these functional components:
gi¼Xnf¼1wfFf iþ ϵi
Where gi is the expression level of gene i, w is the
coefficient of the corresponding functional component F,
andϵ is the noise in the measurement
The extensive corpus of public available microarray
data is useful for identifying functional components that
are representative of fundamental human biology These
“data-derived” features have the advantage of not being
dependent on expert prior knowledge, and can be used
across different experimental conditions In particular,
analysis pipelines built on these features do not requireuser-defined parameters, thus increasing reproducibility
of results Engreitz et al [19] previously demonstratedthe use of ICA to identify such features based on a set of
9395 microarrays from GEO, but the methodologyemployed resulted in many correlated components, withthe maximum correlation being 0.802 We have leveragedthe exponential growth of GEO over the past decade toobtain a ten-fold increase in data for training our ICAmodel Additionally, we have chosen to use the ICA algo-rithm ProDenICA [20], which has been previously docu-mented to have higher sensitivity to a wider range ofsource distribution and better general performance thanthe more common FastICA algorithm [21] Although theoriginal authors of ProDenICA demonstrated its usage inrelatively small datasets, the method has been extended tolarger datasets recently, most notably by Risk et al [22] intheir application to large fMRI data In this paper, weapply the algorithm to an even larger dataset based on thehuman transcriptome (20,089 genes), and identified a set
of 139 functional components from a diverse range ofhuman microarray data Using six different studies fromGEO, we demonstrate the usage of FCs in transcriptomicanalysis Specifically, we present the following:
(1) Rigorous quantification of the 139 FCs throughmultiple repeats and subsampling of data to ensurereproducibility of the components We alsoconstructed a tissue fingerprint library based onGSE3526 and GSE7307 so that query samples can
be quickly mapped to the most similar tissue.(2) Demonstration of the FCs as machine learningfeatures for sample classification in two differentstudies from GEO (rheumatoid arthritis, GSE71370;leukemia, GSE13159) We show that the FCs can beused as classifier features without prior processing,
as opposed to typical workflows that require theidentification of DE gene sets before model training
We also evaluated their robustness in dealing withsmall training sets by subsampling the data fromGSE13159 at different sizes The performance of themodels built using the FCs was then compared tothe ones built using the original genes, and found to
be superior when sample sizes were small We notethat this makes our methodology particularly usefulfor typical studies where the training set consist ofless than 50 samples
(3) Demonstration of FC’s ability to regularize data, usingdata from the MicroArray Quality Control (MAQC)study and a multi-center AML study, GSE15434 TheFC-space clustering of MAQC samples is comparable
to that of the original gene-space, and analysis of theAML study in FC space also produces more parsimo-nious results across the different centers
Trang 3(4) Evaluation of biological relevancy of differentially
expressed FCs We apply differential expression
analysis to two different studies (rhabdomyosarcoma,
GSE66533; dengue virus infection, E-MTAB-3162)
and show that the significant FCs in both cases had
biological annotations that were similar to the results
from the original papers based on gene-level analysis
Methods
Data collection
Raw data was collected as in [23] Briefly, we obtained all
human GEO series records (GSEs) that were found on the
Affymetrix HG-U133 Plus 2.0 platform (GPL570) as of
March 2015 After filtering for GSMs with associated raw
CEL files, we obtained 2753 GSEs, containing 97,049
microarray CEL files The CEL files were then processed
using robust multi-array average (RMA) [24,25] and
cor-rected for technical bias [26] The probes were then
mapped to 20,089 unique Entrez gene identifiers using the
R package Jetset v3.1.2 [27]
The dataset, containing 20,089 genes by 97,049 arrays,
was then quantile-normalized between arrays, and
gene-centered This was followed by scaling and centering of the
dataset by array We denote the resulting matrix by F
Constructing a representative compendium
The Spearman’s rank correlation coefficient (ρi, j) was
computed between all arrays Distances between arrays
Fiand Fj were defined as 1– ρi, j, and hierarchical
clus-tering of the arrays was performed using average linkage
The maximum intra-cluster distance (cutoff height of
tree) was determined by using a k-nearest neighbor knee
plot, and the tree was cut accordingly to obtain the
corre-sponding clusters We excluded clusters with less than five
members and selected the medoids of the remaining
clus-ters as representative samples We refer to the collective
set of medoids as the representative compendium, and
denote it by X To characterize the samples in the
repre-sentative compendium, we extracted the corresponding
metadata (title, source name, characteristics, description and
treatment protocol) from GEO using GEOmetadb [28] and
then parsing them with BioPortal’s Annotator [29] to get the
associated NLM’s Medical Subject Headings (MeSH)
de-scriptors We mapped the descriptors to their highest level
term, and retained only the terms from the following four
categories: A (anatomical terms), C (diseases), D (drugs and
chemicals) and G (phenomena and processes)
To determine the relationship between the size of a
de-rived compendia and the time, we repeated the process
across various calendar years in the GEO repository Arrays
in GEO at the end of each calendar year were processed
similarly to the above to yield both the full compendium
and the corresponding representative compendium for that
year The number of arrays in both compendia was thentabulated as a function of time
Whitening and selection of number of components
Whitening (decorrelation of variables followed by scaling)
of the data matrix [30] was done using singular valuedecomposition (SVD), X = UDVT The orthogonal matrix
U is then inputted to the ICA algorithm The diagonalvalues (dii) of the diagonal matrix D is related to the eigen-values (ei) of the covariance matrix (XTX) by the trans-formation ei¼d ii
g−1¼ d ii
20088 The correction of g− 1 isnecessary for consistency with the unbiased estimate ofvariance
The eigenvalues of the covariance matrix provides a way
to select the number of components In particular, parallelanalysis is a well-documented method to stably performthe selection [31–34] We performed 5000 simulations byrunning SVD on random matrices of the same dimension
as the input matrix X For each sequential component inthe simulations, we obtained the median (Horn’s method[31]) and 95-percentile (Glorfeld’s method [32]) of thecorresponding eigenvalues across the simulations, andused them as the bias We then subtracted the bias fromthe actual eigenvalues of X, and retained the components(n) whose corrected eigenvalues were greater than 1 Wedefine the whitened and reduced data matrix Y (g × n) as
We used the data matrix Y as input for ICA
Independent component analysis
We implemented ICA using the R package ProDenICA[21] The convergence threshold was set to 1e-6, with amaximum iteration of 8000 Additionally, we set thenumber of grid points for density estimation to be 2000,and the robustness parameter (“order”) to 11 ICA pro-duces the following output:
Y¼ SA
Where the source matrix S has dimensions g × n, andthe mixing matrix A has dimensions n × n
A total of 100 independent runs of ICA were performed
on the input data Y, and the solutions were processed in asimilar method to Risk et al [22] First, all solutions wereconverted to their canonical form by ordering the ICs(columns of S) by their respective skewness The solution
Trang 4with the highest negentropy score across the 100 repeats
was chosen to be the“best solution”, and was then
com-pared to the other 99 runs component-wise For the
source matrix from the k-th run, Sk, we computed the
pairwise-component Pearson correlations with the “best
Where ρ is the Pearson correlation matrix
Minimization of the overall cost is a linear assignment
problem, and was solved using the Hungarian algorithm
(R package clue [35]) LetB be the matrix that represents
this assignment, such thatBi,j= 1 if the i-th component of
S0 was assigned to the j-th component of Sk, and zero
otherwise The elements of the signed permutation matrix
P is then defined as
Pi; j¼ 1; 1 r
− i; j< rþ i; j
h i
−1; 1 r−
i; j≥rþ i; j
h i
8
<
:
For Bi,j≠ 0, and zero otherwise The permutation
matrix is a 1–1 mapping and rearranges the columns of
Sk (with appropriate reorientation of direction) so that
the correlations with the respective columns in S0 are
maximized The component-wise correlations of the 99
solutions with the“best solution” is then
cki ¼ cor S0
i; S kPi
Where S0i and [SkP]iare the i-th component (columns) of
the “best solution” and the permuted source matrix from
the k-th run respectively
Evaluation of component estimates
We resampled the full compendium randomly without
re-placement to obtain 50 similar-sized pseudo-representative
compendiums Whitening was performed as described
pre-viously, but we selected the same number of components
as the original solution to facilitate comparison between the
models For each of the 50 resampled compendiums, we
ran ICA ten times, and chose the solution with the highest
negentropy score as the solution for that resampled
com-pendium We compared the 50 chosen solutions to the
“best solution” from the representative compendium using
the same methodology as per the previous section
Biological annotations of components
GO terms and relationship to the H collection in MSigDB
For each component, we defined the sets of genes withloadings that were three standard deviations above or belowthe mean as the up or down modules respectively for thecomponent Collectively, we term the union of both set ofgenes as active genes for the component As per Engreitz et
al [19], we performed GO enrichment analysis, usingTopGO [36] on the up and down modules separately.The percentage overlap between gene signatures fromthe H collection of MSigDB [3] and the active genes foreach FC was calculated For each gene signature-FC pair,
we also checked for enrichment of overlapped genes byperforming a hypergeometric test; only pairs that had aBH-corrected p-value of less than 0.01 were retained
Fingerprinting human tissues: GSE3526 and GSE7307
All 353 normal human samples from GSE3526, comingfrom 65 different tissue types derived from ten post-mor-tem donors, were downloaded from GEO, processed andprojected into FC space To obtain representative samplesfrom the 22 nervous system tissues, we calculated the pair-wise distances within each tissue type and selected themedoid (sample with the minimum distance to all othersamples within the same tissue type) Clustering of the 22samples was then performed The set of all samples fromGSE3526 were also used as a compendium to annotatequeries with their most similar tissue origin
GSE7307 (Human Body Index) contains 677 samplesfrom 90 tissue types, some of which were from diseasedpatients We downloaded only the healthy samples, andprocessed them as per GSE3526 We compared the tissuestypes that were common to both GSE3526 and GSE7307using Pearson correlation coefficient To provide robustestimates that were not affected by outlying samples inthe tissue types, we reported the median and the standarddeviation of the correlations for each“GSE/tissue”-“GSE/tissue” pair We also included the sub-compendium (“Hu-man Tissue Compendium”), containing both 353 samplesfrom GSE3526 and 504 samples from GSE7307, in our Rpackage so that users can also use it to annotate theirquery samples with the most probable tissue types
FC applications and analysis
For all evaluation datasets, the raw CEL files were loaded from GEO Each GSE was processed independently
down-by running RMA on the set of samples, followed down-bytechnical bias correction as per the earlier section “DataCollection” Projection of a dataset Qgene into the corre-sponding FC space is done via the matrix multiplication
QFC ¼ STQgene
A unitary vector space based on the FCs loadings canalso be defined by normalizing each component to have
Trang 5unit length, which we provide as an option in our R
pack-age For all analysis in this paper however, projection onto
FC space was done using the original gene loadings in the
calculated FCs
Wherever t-test was used, Benjamini-Hochberg correction
was performed on the p-values [37], with N either being the
total number of FCs (139), or the total number of genes
(20,089) being tested For heatmaps, the genes and arrays
were clustered using hierarchical clustering with average
linkage, and the distance metric for both was defined using
the Pearson correlation:
Disti; j¼ 1−cor Qi; Qj
FCs as features for machine learning algorithms: GSE71370
and GSE13159
For GSE71370, the meta-data available in GEO was used
to annotate the samples under three categories: synovial
fluid from rheumatoid arthritis (RA) patients (RASFM),
peripheral blood from RA patients (RAPBM) and
periph-eral blood from healthy patients (HCPBM) Gene
expres-sion data were projected into FC space For each of the
three pair-wise comparisons between categories, unpaired
t-tests were performed across the FCs, with a BH-corrected
p-value threshold of 0.05 The union of the three sets of
dif-ferentially expressed FCs was then used as the signature to
cluster the sample types We performed hierarchical
clus-tering based on the FC values in the signature, using
aver-age linkaver-age To identify FCs that were specific to the
differences between RAPBM and RASFM, we focused on
the DE FCs that were unique to the pair (I.e not in
com-mon with DE FCs from the RASFM vs RAPBM analysis),
and report the corresponding GO enrichment annotations
GSE13159 contains data from the Microarray
Innova-tions in Leukemia (MILE) study program, consisting of
eighteen different categories of leukemia (including a
con-trol group) The class labels of the individual sample were
obtained from the Data Supplement accompanying the
original publication [38] After preprocessing as described
earlier, the data was projected into FC space using the
uni-tary vector space The classification results were obtained
using the same methodology of the original authors, by
applying support vector machine (SVM) classifiers in
three independent runs using 30-fold cross validations
The R package kernlab [39] was used to implement the
classifiers with a linear kernel function We also defined
the call rate (CR) similarly as the number of determinable
calls The sensitivity for each class was calculated as the
fraction of correctly predicted samples in that class out of
all determinable calls in the run We report the mean CR
and sensitivity across the three runs
Performance of FC-based models in low sample settings
Samples from classes C3 (c-ALL/pre-B-ALL with t(9;22),
122 samples) and C8 (c-ALL/pre-B-ALL without t(9;22),
237 samples) in GSE13159 were defined as the positive andnegative groups respectively For a given simulation run, werandomly chose 22 C3 and 37 C8 samples as the held-outtest set The remaining data in the two groups (100 C3 and
200 C8) were then subsampled at 5 10%, 20%, 40%, 60%and 80% to produce corresponding training sets for trainingSVM classifiers (same parameters as the above analysis forGSE13159) For each subsampling percentage, we repeatedthe sampling 200 times For a particular sampling, wecalculate the negative predictive value (NPV), positive pre-dictive value (PPV, aka precision), sensitivity (aka recall),specificity and accuracy as follows:
P¼ Class ¼ C3½ ; N ¼ Class ¼ C8½ Pred:P ¼ Predicted Class ¼ C3½ ; Pred:N
¼ Predicted Class ¼ C8½
TPx¼ # P∩Pred:P½ x; TNx¼ # N∩Pred:N½ x
PPVx¼ TPx
# Pred:P½ xNPVx¼ TNx
# Pred½ :NxSensitivityx¼TPx
# P½ Specificityx¼TNx
# N½ Accuracyx¼TPxþ TNx
McNe-The average across the 200 sampling for a given sampling percentage were then recorded as the respectivestatistic for that run A total of ten independent simulationruns were performed, and the mean and standard devi-ation for the statistics were reported across the ten runs
Trang 6sub-For comparison, we also calculated the above statisticsusing the full remaining data (i.e subsampling percent-age is 100%) at each run Repeats were not performedfor this case.
FCs retain biological information while regularizing data:MAQC and GSE15434
Affymetrix HGU-133 Plus 2.0 samples from the Array Quality Control (MAQC) project were downloadedfrom GEO (GSE5350) and processed The 120 sampleswere clustered in both FC space and full gene space, andcophenetic correlation between the trees was computed.For visualization purposes, a tanglegram [40] using bothtrees was also generated For evaluation of the FC-basedclustering tree, we grouped samples from A and C as amega-class, and B and D as the other mega-class The treewas cut to yield two clusters, and these were then classi-fied as one of the two mega-classes based on the majority
Micro-of the cluster membership The purity Micro-of the clusteringwas calculated as
Purity¼ 1
120
X2 i¼1
#Correctly Classified Samples in Ci
Where Ci is the i-th cluster The Gini impurity foreach of the two clusters was calculated as
Fig 1 Sizes of full and representative compendium as a function of
time The number of arrays in both compendia is plotted here, as a
function of the year In 2015, there were 97,049 arrays in the full
compendium and 2726 in the representative compendium
Fig 2 Variance explained by eigenvectors from SVD (Left) Percentage of variance explained by each eigenvector (Right) Cumulative variance explained by the eigenvectors The blue lines represent the cumulative variance explained by the leading 139 eigenvectors
Trang 7Gini Impurity Cð Þ ¼ 1−i X2
j¼1
f2j
Where fj is the fraction of samples in the i-th cluster
that are from the j-th mega-class
GSE15434 contains a total of 251 AML samples, coming
from three different centers in Germany: Dresden (DRE),
Munich (MUC) and Ulm (ULM), with 78, 96 and 77
samples respectively Approximately half of the samples
contained mutations in the NPM1 gene We identified
dif-ferentially expressed (DE) functional components (FCs) and
genes between the NPM1-mutated and NPM1-wild type
groups using the R package limma [41] at a false discovery
rate threshold of 1%, and compared the number of shared
DE FCs/genes between the three test centers We also
per-formed a typical gene set enrichment analysis [3] using the
4725 curated gene sets in the C2 collection of MSigDBv5.0, using the recommended parameters of 1000 pheno-type permutations and a false discovery rate (FDR) of 25%.Significant gene sets were identified for both NPM1-mu-tated and NPM1-wild type groups
Differentially expressed FCs are biologically relevant:GSE66533 and E-MTAB-3162
For GSE66533, the rhabdomyosarcoma samples were rated into two main groups (33 PAX3-FOX01 Fusion-Posi-tive and 25 Fusion-Negative samples) based on descriptionsobtained from Supplementary 1 of the paper by Sun et al.[42] Gene expression data were projected into FC space,and unpaired t-tests were performed across the FCs toidentify DE FCs To perform a search for similar samples,
sepa-we calculated the Pearson correlation coefficient in FCspace between samples from the study and the full com-pendium For each sample in either group, we retained allGSMs from the full compendium that had a correlation ofmore than 0.95, and term these“neighbors” We then tookthe union of these “neighbors” within a group, andremoved GSMs that were not considered“neighbors” to atleast half of the group’s members To identify GSMs thatwere unique to either group, we focus on the set-difference
Fig 3 Correlation of FCs between the best solution and each of the
99 other runs The yellow dots and the black lines are the mean and
standard error of the mean Pearson correlation coefficient values
respectively, for each FC The empty blue circles are the maximum
coefficient recorded for the FCs
Fig 4 Promiscuity of significant genes in FCs Significant genes for each FC were pooled together and tabulated The histogram shows the distribution of how frequently a gene is found to be significant
in one or more FCs The x-axis is the number of FCs in which a particular gene is found to be significant, and the y-axis is the number
of unique genes that meets that corresponding requirement For instance, the maximum number of FCs that a gene was found to be significant in was 44, with only one gene achieving that criteria (AKR1C3; Entrez ID 8644) This observed in the histogram at the 44th position on the x-axis, with a height corresponding to 1 (represented
as a dot due to the scale) A total of 9091 genes were found to be significant in only one to three FCs
Trang 8between the two sets of “neighbors” We also applied our
“Human Tissue Compendium” to identify the tissue types
most closely associated with the samples
For E-MTAB-3162, the raw CEL files was downloaded
from ArrayExpress [43] and processed The meta-data
obtained from the sdrf file, and used to divide the dengue
patient samples into the two subgroups (Day 0 vs Day 4)
We performed t-test to identify the set of DE FCs To map
the GO annotations to GO slim terms, we used the
Map2-Slim tool [44] from the Gene Ontology project, with the
go-basicontology and the default goslim_generic subset
Results
Representative compendium and parallel analysis
To avoid overrepresentation of any biological phenotype
in the training data, clustering of microarray samples
was performed on the full compendium (97,049 arrays)
to obtain a representative compendium The height
cut-off of the clustering tree was determined to be 0.3 based on
k-nearest neighbor plots for k = 4 and 5 (see Additional file1:
Figure S1) After the filtering step described in the methods
section, we obtained a representative compendium
consist-ing of 2726 samples The clusterconsist-ing and filterconsist-ing process
was found to be robust against varying sizes of the full
compendium, and scaled closely with the latter (Fig 1)
86.4% of the samples have between two to nine unique
MeSH terms coming from the four MeSH categories
(Additional file 1: Table S1) The MeSH annotations of
the representative compendium (Additional file1: Table S1)indicate that about a third of the samples were cancer-re-lated (MeSH term: C04, neoplasms), with substantial num-ber of representatives from other pathological conditionsand diseases found in skin and immune system There arealso representatives from all major anatomical classes(MeSH terms: A0-A9)
After whitening of the representative compendium,parallel analysis suggested that only the leading 139components should be retained We note that thenumber of retained components was the same for bothimplementations of parallel analysis, using either themedian (Horn’s method [31]) or the 95-th percentile(Glorfeld’s method [32]) for determining bias Collect-ively, the 139 components of the whitened data ex-plained close to 80% of the total variance in therepresentative compendium (Fig 2) This whitenedand reduced matrix (20,089 × 139) was then used forsubsequent ICA processing
ICA and evaluation of component estimates
On average, each of the 100 independent runs took 2461iterations to reach the convergence requirement The finalnegentropy of the ICA solutions ranged from 0.2417 to0.2443, with a median of 0.2442 Run 39’s solution yieldedthe highest negentropy and was thus used as the “bestsolution” in the rest of the paper The 139 columns of thecanonized S matrix are the independent components
Fig 5 Relationship between FCs and H collection from MSigDB Overlaps between the active genes and gene signatures from the H collection were filtered for statistical significance and then presented as a percentage of the total number of genes in each signature
Trang 9obtained from ICA, and we refer to them as functional
components (FCs) Each FC had zero mean and unit
standard deviation
The derived FCs were well correlated between all 100
runs (Fig 3), and for the majority of the FCs, the mean
Pearson correlation coefficient was more than 0.8, with the
maximum being close to 1 for all the FCs Similar results
were observed when Spearman correlation was used We
note that the leading 25 FCs of our chosen solution were
also highly reproducible in the compendium subsampling
analysis, with median Pearson correlation coefficients of
more than 0.8 However, the correlation coefficients yielded
by the subsampling analysis were uniformly lower than the
ones observed in Fig.3 across the FCs, and greater so for
the tailing FCs In particular, FCs 65 to 139 had a maximum
Pearson correlation coefficient of less than 0.8 in the
sub-sampling analysis
Biological interpretation of FCs
To gain better understanding of the FCs, we identified
the key gene contributors to each of them The elements
of each component are the gene loadings, and can be
interpreted as the level of contribution of a gene to the
component’s score For a given FC, we consider the set of
genes whose absolute loading is three standard deviations
above the mean as active genes Apart from FC 1, which
only had 28 active genes, the number of active genes in
the other FCs ranged from 103 to 494, with a median of
382 Amongst the 20,089 genes, 12,978 genes were found
to be active in at least one FC The majority of the genes
were active in only up to three FCs (Fig.4), and the
max-imum number of components that a gene was observed
to be active in was 44
The active genes for each FC were then used to obtain
GO annotations for the corresponding FC Of the 139
FCs, 22 did not have any GO annotations, and a further
14 had only one GO annotation The largest number of
GO annotation belonging to an FC was 58 (FCs 3 and 4)
A total of 689 unique GO codes were obtained across the
139 FCs, a 66% increase compared to the 415 unique GO
codes obtained from the corresponding 139 leading
princi-pal components This suggests that there is more biological
signal in the FCs than components obtained via PCA, in
line with current literature [13] The GO annotations for
some of the FCs are presented in this paper as part of the
reanalysis of other gene expression studies; the complete
set of GO annotation for the FCs can be found in our R
package, humanFC
The percentage gene overlap between active genes in the
FCs and the respective gene signatures in the H collection of
MSigDB were calculated, and only the statistically significant
pairs are shown in Fig.5 The highest overlap (75%) occurs
between FC 2 and the H collection signature
“INTERFERO-N_ALPHA_RESPONSE”, which contains 97 genes Half of
the signatures in the H collection contain 200 genes each, soeven a pair with 50% gene overlap in Fig.5can indicate up
to 100 shared genes For instance, FC 10 and the gene ture “HEME_METABOLISM” have only a 52.5% overlap,but the actual number of shared genes is 105 In particular,
signa-FC 10 has five GO annotations (GO:0006782, GO:0051597,GO:0015701, GO:0006879 and GO:0048821) that are allrelated to heme metabolism, supporting a strong relationshipwith the namesake gene signature
Twenty of our FCs do not have any significant geneoverlap with the signatures in the H collection Ofthese, fourteen of them (FC 1, 36, 51, 59, 60, 90, 94,
99, 102, 103, 112, 118, 123, 128 and 137) also do nothave any GO annotations The lack of GO annotationsfor these FCs does not necessarily indicate a lack ofbiological significance; for instance, the active genes
Table 1 Active Genes for FC 1
Trang 10in FC 1 are clearly markers for sex-specific features
(Table 1) Insight into the characteristics of these FCs
can also be obtained by looking at the tissue samples
microarray experiments that have the highest or lowest
score in those FCs In the case of FC 36, the ten
low-est scoring samples were mostly from myeloma cells,
whereas the highest scoring samples were from normal
epithelia
Fingerprinting human tissues
We built a database of tissue fingerprints so that it could be
used to annotate future samples In order to avoid fitting to
errors from a single study, we compared the fingerprints
from two relevant tissue studies (GSE 3526 and GSE7307)
with each other
About a third of the samples from GSE3526 were from
22 tissue types belonging to the nervous system, and we
performed clustering of the representatives from these
tis-sues (Fig.6) The clustering displayed underlying
anatom-ical and physiologanatom-ical similarities between the tissues For
instance, the tissues from the three lobes (parietal, occipital
and temporal) were grouped together with the cerebral
cor-tex in one major cluster, whereas the other cluster was
enriched for tissues from the peripheral nervous system,
such as ganglia tissues (trigeminal, dorsal root) and the
spinal cord, and most members of the basal ganglia
(sub-stantia nigra, subthalamic nucleus, ventral tegmental area)
There are a total of 65 tissues types that were common toboth GSE3526 and GSE7307 based on the annotations inGEO (the tissue types in former is a proper subset of thelatter) The median Pearson correlation coefficients (MPC)between tissues from GSE3526 and GSE7307 are shown inFig 7 Tissues from the same classes (diagonal of Fig 7)were highly correlated, with an average MPC of 0.985 and
an interquartile range of 0.981 to 0.990 The mean standarddeviation across the whole MPC matrix was 0.0269, with
an interquartile range of 0.0157 to 0.0346
FCs as features for machine learning algorithms
To demonstrate the applicability of our FCs as features foruse in machine learning algorithms, we apply our FCs totwo different studies (rheumatoid arthritis and leukemia).Additionally, we performed subsampling of the leukemiastudy to compare how model performances in FC spaceand full gene space are affected in low-sample settings
GSE71370 (rheumatoid arthritis)
GSE71370 contains three sample types: peripheral bloodfrom rheumatoid arthritis (RA) patients (RAPBM), periph-eral blood from healthy patients (HCPBM), and synovialfluid from RA patients (RASFM) Using the standard Affy-metrix chip definition file (CDF), we found 6636 DE genesbetween RASFM and HCPBM, and zero DE genes betweenRAPMB and HCPMB
Fig 6 Representative tissue samples from the nervous system (GSE3526) Medoid samples from each tissue type were selected to be the
representative for the tissue The representatives were then clustered using hierarchical clustering with average linkage The full range of samples can be found in Additional file 1: Figure S3 (no clustering performed, but the order of the tissue types is the same as here) The order of the FCs
in the plot can be found in Additional file 1: Text S1
Trang 11There are zero differentially expressed (DE) FCs between
RAPBM and HCPBM, 72 DE FCs between RAPBM and
RASFM, and 89 DE FCs between RASFM and HCPBM 61
FCs were common in the latter two sets, resulting in a
combined signature of 100 FCs for clustering Figure 8
shows the clustering results using the signature A distinct
separation between the classes is observed, and the two
subgroups from the same tissue type (peripheral blood) are
clustered together There are eleven DE FCs that are
unique to the comparison between RASFM and RAPBM
Additional file1: Figure S4 shows the corresponding
clus-tering results in gene space, and Additional file1: Table S3
lists the FCs and the corresponding GO annotations Intotal, there were 75 unique GO terms that were associatedwith the selected FCs
GSE13159 (leukemia)
GSE13159 contains patient samples from 18 different ses of leukemia Table2shows the confusion matrix of theSVM classification model using data that was projectedinto our FC space, and Fig.9summarizes the average dif-ferences between our confusion matrix and that from theoriginal paper (Table2in Haferlach et al [38]) after nor-malizing for class size The call rates (CR) achieved by
clas-Fig 7 Tissue sample correlations between GSE3526 and GSE7307 Pairwise Pearson correlation coefficients were calculated (in FC space) between samples from GSE3526 and GSE7307 The figure shows the median correlation scores for each GSE/tissue-GSE/tissue pair High correlation is observed between anatomically-related tissues
Trang 12both models are very similar, although the class-wise
sen-sitivity of the model from Haferlach et al was generally
slightly better, averaging at 0.0692 higher than the ones
from our model For half of the 18 classes, the differences
between the sensitivities from the two models were
insig-nificant (the median difference is 0.0575), and for class
C15, our FC-based model outperformed Haferlach’s model
marginally The misclassification patterns (off-diagonals of
the confusion matrices) were similar between both
models, although our FC-model misclassified samples as
C8 or C13 more frequently
The random forest we built indicated that FC 18, 39
and 54 are the three most important variables(Fig.10)
The corresponding GO annotations for the three FCs(Table3) are all related to immune response
Performance of FC-based models in low sample settings
We subsampled two classes from the leukemia study atvarious fractions to create datasets of varying sizes The
FC space models had higher NPV, sensitivity and accuracythan the full gene space models when the fraction of train-ing data used was low (Fig.11b, cande) Specifically, weobserved that the FC-based models had higher sensitivityfor subsampling fractions of up to 20% of the full trainingsize (300), and higher accuracy and negative predictivevalue (p < 0.01) for subsampling fractions of up to 10%
Fig 8 Clustering of GSE71370 samples using 100 FCs 100 FCs were identified to be DE between the pairwise classes, and were used to perform clustering on the samples The three sample classes were separated very well by hierarchical clustering (average linkage), with only GSM1833142 appearing to be clustered incorrectly as RASFM The order of the FCs in the heatmap can be found in Additional file 1: Text S1