Data-driven human transcriptomic modules determined by independent component analysis

Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level.

Trang 1

R E S E A R C H A R T I C L E Open Access

Data-driven human transcriptomic modules

determined by independent component

In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis We apply independentcomponent analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomicmodules that can be used as features for machine learning We evaluate the usability of these modules across sixstudies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing withsmall training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of

differentially expressed features

Results: We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs) Instudies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, withhigher sensitivity (p < 0.01) The models also had higher accuracy and negative predictive value (p < 0.01) forsmall data sets (less than 30 samples) Additionally, we observed a reduction in batch effects when data isclustered in the FC-space Finally, we found that differentially expressed FCs mapped to GO terms that werealso identified via traditional gene-based approaches

Conclusions: The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their

performance in low sample settings suggest that they should be employed in such studies in order to harnessthe data efficiently

Keywords: Independent component analysis, Gene expression, Functional modules, Transcriptome

Background

The human transcriptome, a snapshot of all mRNA

mole-cules in a cell or tissue, is invaluable in advancing precision

medicine Many public databases have been established to

map drug responses to transcriptomic profiles, such as the

Welcome Trust Sanger Institute’s Cancer Genome Project

(CGP), the Connectivity Map (CMap) [1] and the Library

of Network-based Cellular Signatures (LINCS) While the

ability to measure gene expression levels of nearly every

expressed gene in a cell allows for precise characterization

of tissues at the molecular level, transcriptomic data isinherently noisy due to the dynamic nature of transcription.This makes it difficult to identify patient subtypes when theeffect size is small, and also confounds direct interpretation

of analysis results Statistical methods to handle suchhigh-dimension data typically control the false discoveryrates through p-value corrections and q-value thresh-olding, or increase power via the simultaneous study

of multiple genes (i.e gene sets) Gene set enrichmentanalysis (GSEA) [2, 3] is widely used today in tran-scriptomic analysis, and is facilitated by the MolecularSignatures Database (MSigDB) [3], a database with

* Correspondence: russ.altman@stanford.edu

1 Department of Bioengineering, Stanford University, Stanford, CA 94305, USA

2

Department of Genetics, Stanford University, Stanford, CA 94305, USA

© The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made The Creative Commons Public Domain Dedication waiver

Trang 2

17,779 gene signatures across seven collections In practice,

researchers running GSEA typically choose a particular

subset or collection from MSigDB based on what they

believe to be related to the tissue or condition of the

sam-ple This may induce biasness in the analysis as GSEA is

sensitive to the choice of subsets used [4], and also the

amount of gene filtering steps done during preprocessing

[5] Furthermore, Liberzon et al [6] also found significant

redundancies in MSigDB’s signatures, which can skew the

reported enrichment scores from GSEA

An alternative to using user-defined gene sets is to

em-ploy data-driven approaches to construct lower-dimension

features so that the statistical power can be increased

Principal component analysis has previously been

per-formed on microarray data to summarize the

experimen-tal dataset containing tens of thousands of genes to a

feature space that is a hundred-fold smaller [7,8] It was

observed that while the first three or four principal

com-ponents can sufficiently capture most biological signals in

a large microarray dataset [9–11], they fail to do so when

there is a small effect size and/or when there is a small

number of samples exhibiting the effect [12] Recent

ana-lysis performed by Tan et al on Pseudomonas aeruginosa

gene expression profiling experiments [13, 14] showed

that components derived from PCA had fewer associated

biological pathways than components from competing

methods such as ICA More fundamentally, the

under-lying assumption of PCA (that the data is a multivariate

Gaussian) does not hold for transcriptomic data, which are

typically super-Gaussian Lee and Batzoglou [15] suggested

the use of a related technique, independent component

analysis (ICA), as a more faithful model for such

non-Gaussian data The statistically independent components

obtained from ICA have been reported to have biological

significance [16–18], and are alternatively known as

meta-genes, transcriptomic modules or functional components

(FCs) Unlike gene sets, where a gene’s membership is

binary, functional components present a smoothed and

continuous version of set membership, better reflecting the

complex network and co-dependency of genes A sample

transcriptome can then be expressed as a linear

combin-ation of these functional components:

gi¼Xnf¼1wfFf iþ ϵi

Where gi is the expression level of gene i, w is the

coefficient of the corresponding functional component F,

andϵ is the noise in the measurement

The extensive corpus of public available microarray

data is useful for identifying functional components that

are representative of fundamental human biology These

“data-derived” features have the advantage of not being

dependent on expert prior knowledge, and can be used

across different experimental conditions In particular,

analysis pipelines built on these features do not requireuser-defined parameters, thus increasing reproducibility

of results Engreitz et al [19] previously demonstratedthe use of ICA to identify such features based on a set of

9395 microarrays from GEO, but the methodologyemployed resulted in many correlated components, withthe maximum correlation being 0.802 We have leveragedthe exponential growth of GEO over the past decade toobtain a ten-fold increase in data for training our ICAmodel Additionally, we have chosen to use the ICA algo-rithm ProDenICA [20], which has been previously docu-mented to have higher sensitivity to a wider range ofsource distribution and better general performance thanthe more common FastICA algorithm [21] Although theoriginal authors of ProDenICA demonstrated its usage inrelatively small datasets, the method has been extended tolarger datasets recently, most notably by Risk et al [22] intheir application to large fMRI data In this paper, weapply the algorithm to an even larger dataset based on thehuman transcriptome (20,089 genes), and identified a set

of 139 functional components from a diverse range ofhuman microarray data Using six different studies fromGEO, we demonstrate the usage of FCs in transcriptomicanalysis Specifically, we present the following:

(1) Rigorous quantification of the 139 FCs throughmultiple repeats and subsampling of data to ensurereproducibility of the components We alsoconstructed a tissue fingerprint library based onGSE3526 and GSE7307 so that query samples can

be quickly mapped to the most similar tissue.(2) Demonstration of the FCs as machine learningfeatures for sample classification in two differentstudies from GEO (rheumatoid arthritis, GSE71370;leukemia, GSE13159) We show that the FCs can beused as classifier features without prior processing,

as opposed to typical workflows that require theidentification of DE gene sets before model training

We also evaluated their robustness in dealing withsmall training sets by subsampling the data fromGSE13159 at different sizes The performance of themodels built using the FCs was then compared tothe ones built using the original genes, and found to

be superior when sample sizes were small We notethat this makes our methodology particularly usefulfor typical studies where the training set consist ofless than 50 samples

(3) Demonstration of FC’s ability to regularize data, usingdata from the MicroArray Quality Control (MAQC)study and a multi-center AML study, GSE15434 TheFC-space clustering of MAQC samples is comparable

to that of the original gene-space, and analysis of theAML study in FC space also produces more parsimo-nious results across the different centers

Trang 3

(4) Evaluation of biological relevancy of differentially

expressed FCs We apply differential expression

analysis to two different studies (rhabdomyosarcoma,

GSE66533; dengue virus infection, E-MTAB-3162)

and show that the significant FCs in both cases had

biological annotations that were similar to the results

from the original papers based on gene-level analysis

Methods

Data collection

Raw data was collected as in [23] Briefly, we obtained all

human GEO series records (GSEs) that were found on the

Affymetrix HG-U133 Plus 2.0 platform (GPL570) as of

March 2015 After filtering for GSMs with associated raw

CEL files, we obtained 2753 GSEs, containing 97,049

microarray CEL files The CEL files were then processed

using robust multi-array average (RMA) [24,25] and

cor-rected for technical bias [26] The probes were then

mapped to 20,089 unique Entrez gene identifiers using the

R package Jetset v3.1.2 [27]

The dataset, containing 20,089 genes by 97,049 arrays,

was then quantile-normalized between arrays, and

gene-centered This was followed by scaling and centering of the

dataset by array We denote the resulting matrix by F

Constructing a representative compendium

The Spearman’s rank correlation coefficient (ρi, j) was

computed between all arrays Distances between arrays

Fiand Fj were defined as 1– ρi, j, and hierarchical

clus-tering of the arrays was performed using average linkage

The maximum intra-cluster distance (cutoff height of

tree) was determined by using a k-nearest neighbor knee

plot, and the tree was cut accordingly to obtain the

corre-sponding clusters We excluded clusters with less than five

members and selected the medoids of the remaining

clus-ters as representative samples We refer to the collective

set of medoids as the representative compendium, and

denote it by X To characterize the samples in the

repre-sentative compendium, we extracted the corresponding

metadata (title, source name, characteristics, description and

treatment protocol) from GEO using GEOmetadb [28] and

then parsing them with BioPortal’s Annotator [29] to get the

associated NLM’s Medical Subject Headings (MeSH)

de-scriptors We mapped the descriptors to their highest level

term, and retained only the terms from the following four

categories: A (anatomical terms), C (diseases), D (drugs and

chemicals) and G (phenomena and processes)

To determine the relationship between the size of a

de-rived compendia and the time, we repeated the process

across various calendar years in the GEO repository Arrays

in GEO at the end of each calendar year were processed

similarly to the above to yield both the full compendium

and the corresponding representative compendium for that

year The number of arrays in both compendia was thentabulated as a function of time

Whitening and selection of number of components

Whitening (decorrelation of variables followed by scaling)

of the data matrix [30] was done using singular valuedecomposition (SVD), X = UDVT The orthogonal matrix

U is then inputted to the ICA algorithm The diagonalvalues (dii) of the diagonal matrix D is related to the eigen-values (ei) of the covariance matrix (XTX) by the trans-formation ei¼d ii

g−1¼ d ii

20088 The correction of g− 1 isnecessary for consistency with the unbiased estimate ofvariance

The eigenvalues of the covariance matrix provides a way

to select the number of components In particular, parallelanalysis is a well-documented method to stably performthe selection [31–34] We performed 5000 simulations byrunning SVD on random matrices of the same dimension

as the input matrix X For each sequential component inthe simulations, we obtained the median (Horn’s method[31]) and 95-percentile (Glorfeld’s method [32]) of thecorresponding eigenvalues across the simulations, andused them as the bias We then subtracted the bias fromthe actual eigenvalues of X, and retained the components(n) whose corrected eigenvalues were greater than 1 Wedefine the whitened and reduced data matrix Y (g × n) as

We used the data matrix Y as input for ICA

Independent component analysis

We implemented ICA using the R package ProDenICA[21] The convergence threshold was set to 1e-6, with amaximum iteration of 8000 Additionally, we set thenumber of grid points for density estimation to be 2000,and the robustness parameter (“order”) to 11 ICA pro-duces the following output:

Y¼ SA

Where the source matrix S has dimensions g × n, andthe mixing matrix A has dimensions n × n

A total of 100 independent runs of ICA were performed

on the input data Y, and the solutions were processed in asimilar method to Risk et al [22] First, all solutions wereconverted to their canonical form by ordering the ICs(columns of S) by their respective skewness The solution

Trang 4

with the highest negentropy score across the 100 repeats

was chosen to be the“best solution”, and was then

com-pared to the other 99 runs component-wise For the

source matrix from the k-th run, Sk, we computed the

pairwise-component Pearson correlations with the “best

Where ρ is the Pearson correlation matrix

Minimization of the overall cost is a linear assignment

problem, and was solved using the Hungarian algorithm

(R package clue [35]) LetB be the matrix that represents

this assignment, such thatBi,j= 1 if the i-th component of

S0 was assigned to the j-th component of Sk, and zero

otherwise The elements of the signed permutation matrix

P is then defined as

Pi; j¼ 1; 1 r

− i; j< rþ i; j

h i

−1; 1 r−

i; j≥rþ i; j

h i

8

<

:

For Bi,j≠ 0, and zero otherwise The permutation

matrix is a 1–1 mapping and rearranges the columns of

Sk (with appropriate reorientation of direction) so that

the correlations with the respective columns in S0 are

maximized The component-wise correlations of the 99

solutions with the“best solution” is then

cki ¼ cor S0

i; S kPi

Where S0i and [SkP]iare the i-th component (columns) of

the “best solution” and the permuted source matrix from

the k-th run respectively

Evaluation of component estimates

We resampled the full compendium randomly without

re-placement to obtain 50 similar-sized pseudo-representative

compendiums Whitening was performed as described

pre-viously, but we selected the same number of components

as the original solution to facilitate comparison between the

models For each of the 50 resampled compendiums, we

ran ICA ten times, and chose the solution with the highest

negentropy score as the solution for that resampled

com-pendium We compared the 50 chosen solutions to the

“best solution” from the representative compendium using

the same methodology as per the previous section

Biological annotations of components

GO terms and relationship to the H collection in MSigDB

For each component, we defined the sets of genes withloadings that were three standard deviations above or belowthe mean as the up or down modules respectively for thecomponent Collectively, we term the union of both set ofgenes as active genes for the component As per Engreitz et

al [19], we performed GO enrichment analysis, usingTopGO [36] on the up and down modules separately.The percentage overlap between gene signatures fromthe H collection of MSigDB [3] and the active genes foreach FC was calculated For each gene signature-FC pair,

we also checked for enrichment of overlapped genes byperforming a hypergeometric test; only pairs that had aBH-corrected p-value of less than 0.01 were retained

Fingerprinting human tissues: GSE3526 and GSE7307

All 353 normal human samples from GSE3526, comingfrom 65 different tissue types derived from ten post-mor-tem donors, were downloaded from GEO, processed andprojected into FC space To obtain representative samplesfrom the 22 nervous system tissues, we calculated the pair-wise distances within each tissue type and selected themedoid (sample with the minimum distance to all othersamples within the same tissue type) Clustering of the 22samples was then performed The set of all samples fromGSE3526 were also used as a compendium to annotatequeries with their most similar tissue origin

GSE7307 (Human Body Index) contains 677 samplesfrom 90 tissue types, some of which were from diseasedpatients We downloaded only the healthy samples, andprocessed them as per GSE3526 We compared the tissuestypes that were common to both GSE3526 and GSE7307using Pearson correlation coefficient To provide robustestimates that were not affected by outlying samples inthe tissue types, we reported the median and the standarddeviation of the correlations for each“GSE/tissue”-“GSE/tissue” pair We also included the sub-compendium (“Hu-man Tissue Compendium”), containing both 353 samplesfrom GSE3526 and 504 samples from GSE7307, in our Rpackage so that users can also use it to annotate theirquery samples with the most probable tissue types

FC applications and analysis

For all evaluation datasets, the raw CEL files were loaded from GEO Each GSE was processed independently

down-by running RMA on the set of samples, followed down-bytechnical bias correction as per the earlier section “DataCollection” Projection of a dataset Qgene into the corre-sponding FC space is done via the matrix multiplication

QFC ¼ STQgene

A unitary vector space based on the FCs loadings canalso be defined by normalizing each component to have

Trang 5

unit length, which we provide as an option in our R

pack-age For all analysis in this paper however, projection onto

FC space was done using the original gene loadings in the

calculated FCs

Wherever t-test was used, Benjamini-Hochberg correction

was performed on the p-values [37], with N either being the

total number of FCs (139), or the total number of genes

(20,089) being tested For heatmaps, the genes and arrays

were clustered using hierarchical clustering with average

linkage, and the distance metric for both was defined using

the Pearson correlation:

Disti; j¼ 1−cor Qi; Qj

FCs as features for machine learning algorithms: GSE71370

and GSE13159

For GSE71370, the meta-data available in GEO was used

to annotate the samples under three categories: synovial

fluid from rheumatoid arthritis (RA) patients (RASFM),

peripheral blood from RA patients (RAPBM) and

periph-eral blood from healthy patients (HCPBM) Gene

expres-sion data were projected into FC space For each of the

three pair-wise comparisons between categories, unpaired

t-tests were performed across the FCs, with a BH-corrected

p-value threshold of 0.05 The union of the three sets of

dif-ferentially expressed FCs was then used as the signature to

cluster the sample types We performed hierarchical

clus-tering based on the FC values in the signature, using

aver-age linkaver-age To identify FCs that were specific to the

differences between RAPBM and RASFM, we focused on

the DE FCs that were unique to the pair (I.e not in

com-mon with DE FCs from the RASFM vs RAPBM analysis),

and report the corresponding GO enrichment annotations

GSE13159 contains data from the Microarray

Innova-tions in Leukemia (MILE) study program, consisting of

eighteen different categories of leukemia (including a

con-trol group) The class labels of the individual sample were

obtained from the Data Supplement accompanying the

original publication [38] After preprocessing as described

earlier, the data was projected into FC space using the

uni-tary vector space The classification results were obtained

using the same methodology of the original authors, by

applying support vector machine (SVM) classifiers in

three independent runs using 30-fold cross validations

The R package kernlab [39] was used to implement the

classifiers with a linear kernel function We also defined

the call rate (CR) similarly as the number of determinable

calls The sensitivity for each class was calculated as the

fraction of correctly predicted samples in that class out of

all determinable calls in the run We report the mean CR

and sensitivity across the three runs

Performance of FC-based models in low sample settings

Samples from classes C3 (c-ALL/pre-B-ALL with t(9;22),

122 samples) and C8 (c-ALL/pre-B-ALL without t(9;22),

237 samples) in GSE13159 were defined as the positive andnegative groups respectively For a given simulation run, werandomly chose 22 C3 and 37 C8 samples as the held-outtest set The remaining data in the two groups (100 C3 and

200 C8) were then subsampled at 5 10%, 20%, 40%, 60%and 80% to produce corresponding training sets for trainingSVM classifiers (same parameters as the above analysis forGSE13159) For each subsampling percentage, we repeatedthe sampling 200 times For a particular sampling, wecalculate the negative predictive value (NPV), positive pre-dictive value (PPV, aka precision), sensitivity (aka recall),specificity and accuracy as follows:

P¼ Class ¼ C3½ ; N ¼ Class ¼ C8½ Pred:P ¼ Predicted Class ¼ C3½ ; Pred:N

¼ Predicted Class ¼ C8½

TPx¼ # P∩Pred:P½ x; TNx¼ # N∩Pred:N½ x

PPVx¼ TPx

# Pred:P½ xNPVx¼ TNx

# Pred½ :NxSensitivityx¼TPx

# P½ Specificityx¼TNx

# N½ Accuracyx¼TPxþ TNx

McNe-The average across the 200 sampling for a given sampling percentage were then recorded as the respectivestatistic for that run A total of ten independent simulationruns were performed, and the mean and standard devi-ation for the statistics were reported across the ten runs

Trang 6

sub-For comparison, we also calculated the above statisticsusing the full remaining data (i.e subsampling percent-age is 100%) at each run Repeats were not performedfor this case.

FCs retain biological information while regularizing data:MAQC and GSE15434

Affymetrix HGU-133 Plus 2.0 samples from the Array Quality Control (MAQC) project were downloadedfrom GEO (GSE5350) and processed The 120 sampleswere clustered in both FC space and full gene space, andcophenetic correlation between the trees was computed.For visualization purposes, a tanglegram [40] using bothtrees was also generated For evaluation of the FC-basedclustering tree, we grouped samples from A and C as amega-class, and B and D as the other mega-class The treewas cut to yield two clusters, and these were then classi-fied as one of the two mega-classes based on the majority

Micro-of the cluster membership The purity Micro-of the clusteringwas calculated as

Purity¼ 1

120

X2 i¼1

#Correctly Classified Samples in Ci

Where Ci is the i-th cluster The Gini impurity foreach of the two clusters was calculated as

Fig 1 Sizes of full and representative compendium as a function of

time The number of arrays in both compendia is plotted here, as a

function of the year In 2015, there were 97,049 arrays in the full

compendium and 2726 in the representative compendium

Fig 2 Variance explained by eigenvectors from SVD (Left) Percentage of variance explained by each eigenvector (Right) Cumulative variance explained by the eigenvectors The blue lines represent the cumulative variance explained by the leading 139 eigenvectors

Trang 7

Gini Impurity Cð Þ ¼ 1−i X2

j¼1

f2j

Where fj is the fraction of samples in the i-th cluster

that are from the j-th mega-class

GSE15434 contains a total of 251 AML samples, coming

from three different centers in Germany: Dresden (DRE),

Munich (MUC) and Ulm (ULM), with 78, 96 and 77

samples respectively Approximately half of the samples

contained mutations in the NPM1 gene We identified

dif-ferentially expressed (DE) functional components (FCs) and

genes between the NPM1-mutated and NPM1-wild type

groups using the R package limma [41] at a false discovery

rate threshold of 1%, and compared the number of shared

DE FCs/genes between the three test centers We also

per-formed a typical gene set enrichment analysis [3] using the

4725 curated gene sets in the C2 collection of MSigDBv5.0, using the recommended parameters of 1000 pheno-type permutations and a false discovery rate (FDR) of 25%.Significant gene sets were identified for both NPM1-mu-tated and NPM1-wild type groups

Differentially expressed FCs are biologically relevant:GSE66533 and E-MTAB-3162

For GSE66533, the rhabdomyosarcoma samples were rated into two main groups (33 PAX3-FOX01 Fusion-Posi-tive and 25 Fusion-Negative samples) based on descriptionsobtained from Supplementary 1 of the paper by Sun et al.[42] Gene expression data were projected into FC space,and unpaired t-tests were performed across the FCs toidentify DE FCs To perform a search for similar samples,

sepa-we calculated the Pearson correlation coefficient in FCspace between samples from the study and the full com-pendium For each sample in either group, we retained allGSMs from the full compendium that had a correlation ofmore than 0.95, and term these“neighbors” We then tookthe union of these “neighbors” within a group, andremoved GSMs that were not considered“neighbors” to atleast half of the group’s members To identify GSMs thatwere unique to either group, we focus on the set-difference

Fig 3 Correlation of FCs between the best solution and each of the

99 other runs The yellow dots and the black lines are the mean and

standard error of the mean Pearson correlation coefficient values

respectively, for each FC The empty blue circles are the maximum

coefficient recorded for the FCs

Fig 4 Promiscuity of significant genes in FCs Significant genes for each FC were pooled together and tabulated The histogram shows the distribution of how frequently a gene is found to be significant

in one or more FCs The x-axis is the number of FCs in which a particular gene is found to be significant, and the y-axis is the number

of unique genes that meets that corresponding requirement For instance, the maximum number of FCs that a gene was found to be significant in was 44, with only one gene achieving that criteria (AKR1C3; Entrez ID 8644) This observed in the histogram at the 44th position on the x-axis, with a height corresponding to 1 (represented

as a dot due to the scale) A total of 9091 genes were found to be significant in only one to three FCs

Trang 8

between the two sets of “neighbors” We also applied our

“Human Tissue Compendium” to identify the tissue types

most closely associated with the samples

For E-MTAB-3162, the raw CEL files was downloaded

from ArrayExpress [43] and processed The meta-data

obtained from the sdrf file, and used to divide the dengue

patient samples into the two subgroups (Day 0 vs Day 4)

We performed t-test to identify the set of DE FCs To map

the GO annotations to GO slim terms, we used the

Map2-Slim tool [44] from the Gene Ontology project, with the

go-basicontology and the default goslim_generic subset

Results

Representative compendium and parallel analysis

To avoid overrepresentation of any biological phenotype

in the training data, clustering of microarray samples

was performed on the full compendium (97,049 arrays)

to obtain a representative compendium The height

cut-off of the clustering tree was determined to be 0.3 based on

k-nearest neighbor plots for k = 4 and 5 (see Additional file1:

Figure S1) After the filtering step described in the methods

section, we obtained a representative compendium

consist-ing of 2726 samples The clusterconsist-ing and filterconsist-ing process

was found to be robust against varying sizes of the full

compendium, and scaled closely with the latter (Fig 1)

86.4% of the samples have between two to nine unique

MeSH terms coming from the four MeSH categories

(Additional file 1: Table S1) The MeSH annotations of

the representative compendium (Additional file1: Table S1)indicate that about a third of the samples were cancer-re-lated (MeSH term: C04, neoplasms), with substantial num-ber of representatives from other pathological conditionsand diseases found in skin and immune system There arealso representatives from all major anatomical classes(MeSH terms: A0-A9)

After whitening of the representative compendium,parallel analysis suggested that only the leading 139components should be retained We note that thenumber of retained components was the same for bothimplementations of parallel analysis, using either themedian (Horn’s method [31]) or the 95-th percentile(Glorfeld’s method [32]) for determining bias Collect-ively, the 139 components of the whitened data ex-plained close to 80% of the total variance in therepresentative compendium (Fig 2) This whitenedand reduced matrix (20,089 × 139) was then used forsubsequent ICA processing

ICA and evaluation of component estimates

On average, each of the 100 independent runs took 2461iterations to reach the convergence requirement The finalnegentropy of the ICA solutions ranged from 0.2417 to0.2443, with a median of 0.2442 Run 39’s solution yieldedthe highest negentropy and was thus used as the “bestsolution” in the rest of the paper The 139 columns of thecanonized S matrix are the independent components

Fig 5 Relationship between FCs and H collection from MSigDB Overlaps between the active genes and gene signatures from the H collection were filtered for statistical significance and then presented as a percentage of the total number of genes in each signature

Trang 9

obtained from ICA, and we refer to them as functional

components (FCs) Each FC had zero mean and unit

standard deviation

The derived FCs were well correlated between all 100

runs (Fig 3), and for the majority of the FCs, the mean

Pearson correlation coefficient was more than 0.8, with the

maximum being close to 1 for all the FCs Similar results

were observed when Spearman correlation was used We

note that the leading 25 FCs of our chosen solution were

also highly reproducible in the compendium subsampling

analysis, with median Pearson correlation coefficients of

more than 0.8 However, the correlation coefficients yielded

by the subsampling analysis were uniformly lower than the

ones observed in Fig.3 across the FCs, and greater so for

the tailing FCs In particular, FCs 65 to 139 had a maximum

Pearson correlation coefficient of less than 0.8 in the

sub-sampling analysis

Biological interpretation of FCs

To gain better understanding of the FCs, we identified

the key gene contributors to each of them The elements

of each component are the gene loadings, and can be

interpreted as the level of contribution of a gene to the

component’s score For a given FC, we consider the set of

genes whose absolute loading is three standard deviations

above the mean as active genes Apart from FC 1, which

only had 28 active genes, the number of active genes in

the other FCs ranged from 103 to 494, with a median of

382 Amongst the 20,089 genes, 12,978 genes were found

to be active in at least one FC The majority of the genes

were active in only up to three FCs (Fig.4), and the

max-imum number of components that a gene was observed

to be active in was 44

The active genes for each FC were then used to obtain

GO annotations for the corresponding FC Of the 139

FCs, 22 did not have any GO annotations, and a further

14 had only one GO annotation The largest number of

GO annotation belonging to an FC was 58 (FCs 3 and 4)

A total of 689 unique GO codes were obtained across the

139 FCs, a 66% increase compared to the 415 unique GO

codes obtained from the corresponding 139 leading

princi-pal components This suggests that there is more biological

signal in the FCs than components obtained via PCA, in

line with current literature [13] The GO annotations for

some of the FCs are presented in this paper as part of the

reanalysis of other gene expression studies; the complete

set of GO annotation for the FCs can be found in our R

package, humanFC

The percentage gene overlap between active genes in the

FCs and the respective gene signatures in the H collection of

MSigDB were calculated, and only the statistically significant

pairs are shown in Fig.5 The highest overlap (75%) occurs

between FC 2 and the H collection signature

“INTERFERO-N_ALPHA_RESPONSE”, which contains 97 genes Half of

the signatures in the H collection contain 200 genes each, soeven a pair with 50% gene overlap in Fig.5can indicate up

to 100 shared genes For instance, FC 10 and the gene ture “HEME_METABOLISM” have only a 52.5% overlap,but the actual number of shared genes is 105 In particular,

signa-FC 10 has five GO annotations (GO:0006782, GO:0051597,GO:0015701, GO:0006879 and GO:0048821) that are allrelated to heme metabolism, supporting a strong relationshipwith the namesake gene signature

Twenty of our FCs do not have any significant geneoverlap with the signatures in the H collection Ofthese, fourteen of them (FC 1, 36, 51, 59, 60, 90, 94,

99, 102, 103, 112, 118, 123, 128 and 137) also do nothave any GO annotations The lack of GO annotationsfor these FCs does not necessarily indicate a lack ofbiological significance; for instance, the active genes

Table 1 Active Genes for FC 1

Trang 10

in FC 1 are clearly markers for sex-specific features

(Table 1) Insight into the characteristics of these FCs

can also be obtained by looking at the tissue samples

microarray experiments that have the highest or lowest

score in those FCs In the case of FC 36, the ten

low-est scoring samples were mostly from myeloma cells,

whereas the highest scoring samples were from normal

epithelia

Fingerprinting human tissues

We built a database of tissue fingerprints so that it could be

used to annotate future samples In order to avoid fitting to

errors from a single study, we compared the fingerprints

from two relevant tissue studies (GSE 3526 and GSE7307)

with each other

About a third of the samples from GSE3526 were from

22 tissue types belonging to the nervous system, and we

performed clustering of the representatives from these

tis-sues (Fig.6) The clustering displayed underlying

anatom-ical and physiologanatom-ical similarities between the tissues For

instance, the tissues from the three lobes (parietal, occipital

and temporal) were grouped together with the cerebral

cor-tex in one major cluster, whereas the other cluster was

enriched for tissues from the peripheral nervous system,

such as ganglia tissues (trigeminal, dorsal root) and the

spinal cord, and most members of the basal ganglia

(sub-stantia nigra, subthalamic nucleus, ventral tegmental area)

There are a total of 65 tissues types that were common toboth GSE3526 and GSE7307 based on the annotations inGEO (the tissue types in former is a proper subset of thelatter) The median Pearson correlation coefficients (MPC)between tissues from GSE3526 and GSE7307 are shown inFig 7 Tissues from the same classes (diagonal of Fig 7)were highly correlated, with an average MPC of 0.985 and

an interquartile range of 0.981 to 0.990 The mean standarddeviation across the whole MPC matrix was 0.0269, with

an interquartile range of 0.0157 to 0.0346

FCs as features for machine learning algorithms

To demonstrate the applicability of our FCs as features foruse in machine learning algorithms, we apply our FCs totwo different studies (rheumatoid arthritis and leukemia).Additionally, we performed subsampling of the leukemiastudy to compare how model performances in FC spaceand full gene space are affected in low-sample settings

GSE71370 (rheumatoid arthritis)

GSE71370 contains three sample types: peripheral bloodfrom rheumatoid arthritis (RA) patients (RAPBM), periph-eral blood from healthy patients (HCPBM), and synovialfluid from RA patients (RASFM) Using the standard Affy-metrix chip definition file (CDF), we found 6636 DE genesbetween RASFM and HCPBM, and zero DE genes betweenRAPMB and HCPMB

Fig 6 Representative tissue samples from the nervous system (GSE3526) Medoid samples from each tissue type were selected to be the

representative for the tissue The representatives were then clustered using hierarchical clustering with average linkage The full range of samples can be found in Additional file 1: Figure S3 (no clustering performed, but the order of the tissue types is the same as here) The order of the FCs

in the plot can be found in Additional file 1: Text S1

Trang 11

There are zero differentially expressed (DE) FCs between

RAPBM and HCPBM, 72 DE FCs between RAPBM and

RASFM, and 89 DE FCs between RASFM and HCPBM 61

FCs were common in the latter two sets, resulting in a

combined signature of 100 FCs for clustering Figure 8

shows the clustering results using the signature A distinct

separation between the classes is observed, and the two

subgroups from the same tissue type (peripheral blood) are

clustered together There are eleven DE FCs that are

unique to the comparison between RASFM and RAPBM

Additional file1: Figure S4 shows the corresponding

clus-tering results in gene space, and Additional file1: Table S3

lists the FCs and the corresponding GO annotations Intotal, there were 75 unique GO terms that were associatedwith the selected FCs

GSE13159 (leukemia)

GSE13159 contains patient samples from 18 different ses of leukemia Table2shows the confusion matrix of theSVM classification model using data that was projectedinto our FC space, and Fig.9summarizes the average dif-ferences between our confusion matrix and that from theoriginal paper (Table2in Haferlach et al [38]) after nor-malizing for class size The call rates (CR) achieved by

clas-Fig 7 Tissue sample correlations between GSE3526 and GSE7307 Pairwise Pearson correlation coefficients were calculated (in FC space) between samples from GSE3526 and GSE7307 The figure shows the median correlation scores for each GSE/tissue-GSE/tissue pair High correlation is observed between anatomically-related tissues

Trang 12

both models are very similar, although the class-wise

sen-sitivity of the model from Haferlach et al was generally

slightly better, averaging at 0.0692 higher than the ones

from our model For half of the 18 classes, the differences

between the sensitivities from the two models were

insig-nificant (the median difference is 0.0575), and for class

C15, our FC-based model outperformed Haferlach’s model

marginally The misclassification patterns (off-diagonals of

the confusion matrices) were similar between both

models, although our FC-model misclassified samples as

C8 or C13 more frequently

The random forest we built indicated that FC 18, 39

and 54 are the three most important variables(Fig.10)

The corresponding GO annotations for the three FCs(Table3) are all related to immune response

Performance of FC-based models in low sample settings

We subsampled two classes from the leukemia study atvarious fractions to create datasets of varying sizes The

FC space models had higher NPV, sensitivity and accuracythan the full gene space models when the fraction of train-ing data used was low (Fig.11b, cande) Specifically, weobserved that the FC-based models had higher sensitivityfor subsampling fractions of up to 20% of the full trainingsize (300), and higher accuracy and negative predictivevalue (p < 0.01) for subsampling fractions of up to 10%

Fig 8 Clustering of GSE71370 samples using 100 FCs 100 FCs were identified to be DE between the pairwise classes, and were used to perform clustering on the samples The three sample classes were separated very well by hierarchical clustering (average linkage), with only GSM1833142 appearing to be clustered incorrectly as RASFM The order of the FCs in the heatmap can be found in Additional file 1: Text S1

Định dạng
Số trang	25
Dung lượng	3,72 MB