Second, putative regulatory sequence motifs were identified in the upstream regions of the genes involved in individual complexes and regulons were predicted on the basis of these motifs
Trang 1Transcriptional regulation of protein complexes in yeast
Addresses: * Service de Conformation des Macromolécules Biologiques, Centre de Biologie Structurale et Bioinformatique, CP 263, Université
Libre de Bruxelles, Bld du Triomphe, B-1050 Bruxelles, Belgium † Institut Pasteur, Unité d'Expression des Gènes Eucaryotes, Institut Pasteur,
rue du Docteur Roux, 75724 Paris Cedex 15, France
Correspondence: Shoshana J Wodak E-mail: shosh@ucmb.ulb.ac.be
© 2004 Simonis et al.; licensee BioMed Central Ltd This is an Open Access article: verbatim copying and redistribution of this article are permitted in all
media for any purpose, provided this notice is preserved along with the article's original URL.
Transcriptional regulation of protein complexes in yeast
<p>Multiprotein complexes play an essential role in many cellular processes But our knowledge of the mechanism of their formation,
reg-known regulons, manually curated or identified by genome-wide screens, were mapped onto the components of multiprotein complexes
The complexes comprised manually curated ones and those characterized by high-throughput analyses Second, putative regulatory
sequence motifs were identified in the upstream regions of the genes involved in individual complexes and regulons were predicted on the
basis of these motifs.</p>
Abstract
Background: Multiprotein complexes play an essential role in many cellular processes But our
knowledge of the mechanism of their formation, regulation and lifetimes is very limited We
investigated transcriptional regulation of protein complexes in yeast using two approaches First,
known regulons, manually curated or identified by genome-wide screens, were mapped onto the
components of multiprotein complexes The complexes comprised manually curated ones and
those characterized by high-throughput analyses Second, putative regulatory sequence motifs
were identified in the upstream regions of the genes involved in individual complexes and regulons
were predicted on the basis of these motifs
Results: Only a very small fraction of the analyzed complexes (5-6%) have subsets of their
components mapping onto known regulons Likewise, regulatory motifs are detected in only about
8-15% of the complexes, and in those, about half of the components are on average part of
predicted regulons In the manually curated complexes, the so-called 'permanent' assemblies have
a larger fraction of their components belonging to putative regulons than 'transient' complexes For
the noisier set of complexes identified by high-throughput screens, valuable insights are obtained
into the function and regulation of individual genes
Conclusions: A small fraction of the known multiprotein complexes in yeast seems to have at
least a subset of their components co-regulated on the transcriptional level Preliminary analysis of
the regulatory motifs for these components suggests that the corresponding genes are likely to be
co-regulated either together or in smaller subgroups, indicating that transcriptionally regulated
modules might exist within complexes
Background
Multiprotein complexes such as the ribosome, spliceosome,
cyclosome, proteasome and the nuclear pore complex have an
essential role in cellular processes [1-3] Until recently,
information about the building blocks of specific complexeshas been rather selective, and the mechanisms underlying theformation of these complexes, and their regulation, lifetimesand degradation remain largely unknown
Published: 30 April 2004
Genome Biology 2004, 5:R33
Received: 26 November 2003 Revised: 30 March 2004 Accepted: 6 April 2004 The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2004/5/5/R33
Trang 2One can surmise that the formation of multiprotein
com-plexes might be regulated at different levels, including
tran-scriptional regulation, post-translational modification and
degradation In prokaryotes a significant proportion of the
genes that are co-regulated at the transcriptional level code
for proteins that interact physically This proportion is even
higher for gene groups whose co-regulation is conserved in
different genomes [4] In some multiprotein complexes in
bacteria, the individual components were reported to be
expressed 'as needed', in a time-dependent fashion related to
their role in the complex [5]
In eukaryotes, mainly limited to yeast, gene-expression
pro-files have been shown to correlate with protein function and
protein-protein interactions [6-8] More particularly, genes
corresponding to components of multiprotein complexes
were found to exhibit correlated expression profiles,
espe-cially for complexes that form over a wide range of cellular
conditions [8] In contrast, the relationships between gene
expression and genome-scale two-hybrid interaction data
appear to be more tenuous [6,7,9]
Yeast is an ideal model system in which to investigate the
relations between protein interactions and gene
co-regula-tion It is one of the few organisms in which many individual
protein complexes have been characterized by biochemical
and other methods, with results available in the
Comprehen-sive Yeast Genome Database (CYGD) [10] In addition, two
independent studies recently characterized multiprotein
complexes in yeast by a large-scale experimental approach
involving tandem affinity purification and MS analysis (TAP
[11]) and high-throughput MS protein complex identification
(HMS, [12]) Each study identified several hundred
com-plexes, containing on average about eight and eleven
polypeptides, respectively Many of these were shown to be
associated with known cellular processes
Yeast has also served as a model for the analysis of gene
expression [13-15] and transcriptional regulation [16,17]
Information about the target genes of transcription factors
has been compiled in specialized databases such as
TRANS-FAC [18,19], SCPD [19], YPD [20] and aMAZE [21,22] Most
recently, the genes bound by 106 yeast transcription factors
were identified by a high-throughput approach [16],
produc-ing for the first time a global view of the transcriptional
regu-lation network in this organism
Here we investigate the transcriptional regulation of
multi-protein complexes in yeast In particular we aimed at finding
out to what extent components of such complexes are
co-reg-ulated We first determined the overlap between known sets
of co-regulated genes in yeast and groups of genes coding for
components of individual multiprotein complexes A set of
co-regulated genes is defined here as the group of target genes
of the same transcription factor, and is denoted a 'regulon', in
categories of regulons are considered The manually curatedregulons stored in the databases, and the regulons defined bythe gene-factor associations identified in the high-throughputanalyses mentioned above [16] The protein complexes exam-ined are those manually curated in databases and the twodatasets derived from the recent genome-scale analyses
We then applied pattern-discovery algorithms [24,25] to theupstream sequences of genes coding for the proteins involved
in each of the complexes in the three datasets under ation These algorithms are used to detect sequence patternsshared by some or all of these genes, which are likely to rep-resent binding sites for transcription factors These patternstake the form of short oligonucleotides (hexamers or pairs oftrimers) that occur much more frequently in the upstreamregions of these genes than in the corresponding regionsacross the entire yeast nuclear genome
consider-We have shown recently that these algorithms have an tant advantage of returning predictions with a very small rate
impor-of false positives (over-represented patterns in groups impor-of domly selected genes) when stringent enough statistical crite-ria are used [26] Alternative methods based on matrixdescriptions [27-31] allow a more refined description of pat-tern degeneracy, in which a given sequence position need not
ran-be strictly conserved But, unlike the approach used here, theyhave the inconvenience of nearly always returning a predic-tion, even for random sequences This is particularly prob-lematic when analyzing large groups of genes, of which asizable proportion might not be regulated at the transcrip-tional level, or at least not by the same transcription factor, asmight be the case for many of the protein complexes exam-ined here
Using the set of patterns detected for each complex, we ceeded to predict the components of the complex that arelikely to be co-regulated This is a difficult task, as theupstream regions of genes often contain multiple bindingsites for the same factor or can be regulated by a combination
pro-of different factors that bind to distinct sites [32,33] In tion, pattern-discovery algorithms generally return a number
addi-of strongly overlapping patterns for a given transcription tor, indicating the presence of a partial degeneracy [24,25].Therefore, identifying sets of co-regulated genes usuallyinvolves assembling the patterns into longer motifs, andsearching for upstream regions that score highly against thesemotifs, an approach that often yields ambiguous results.Here we use an alternative approach in which a discriminantanalysis is performed directly on the detected short patternsand their multiple occurrences [26], thereby avoiding the dif-ficult task of pattern assembly This analysis is done for all thecomplexes considered and the results are discussed in terms
fac-of our current knowledge fac-of these complexes and theirregulation
Trang 3Statistically significant associations between annotated complexes and known regulons
(a) Associations between the annotated complexes and annotated regulons
Annotated complex Annotated
regulon
Components
in complex
Genes in regulon
Common genes
E-value Total overlap
Fatty acid synthetase
cytoplasmic
Trang 4Together, the approaches presented here provide valuable
insights into the transcriptional regulation of multiprotein
complexes in yeast and help in extracting information on
function from genome-scale datasets for these complexes
Results
Correspondence between multiprotein complexes and
known regulons
The genes coding for the components of each protein complex
regulons, with the aim of detecting complex-regulon pairswhere the overlap between the components is more extensivethan would be expected by chance
The analyzed datasets of complexes comprised 243 annotatedprotein complexes from CYGD [10] and 725 complexes iden-tified by the high-throughput studies [11,12] The complexesfrom the latter two studies were taken as defined by theirauthors, without further grouping [34] The regulons datasetscomprised the 200 annotated and the 106 high-throughput
(b) Associations between annotated complexes and high-throughput regulons, identified by a genome-wide location analysis [16]
High-throughput regulon
Components
in complex
Genes in regulon
Common genes
E-value Total overlap
Permanent Respiration chain complexes
Cytoplasmic ribosomesCytoplasmic ribosomal large subunit
Only the most statistically significant associations (E-value ≤ 0.01) between complexes and regulons are listed (see Additional data file 1 (Figure b) for a complete list) Each line lists the association detected between a multiprotein complex denoted by its CYGD name (column 2) and a regulon denoted by its common name (column 3) Column 4 lists the number of genes in the complex and column 5 lists the number of genes in the regulon Column 6 lists the number of common genes between the regulon and complex, and column 7 lists the statistical significance criterion (E-value) for the detected overlap (see Materials and methods) The far right column lists the total number of genes in the complex that are common between it and all the regulons that map into it Complexes have been subdivided into three categories, 'permanent', 'transient' or 'others', as indicated in column 1, and described in Materials and methods When a smaller complex is completely included within a larger one and detected associations map into it, only the smaller complex is listed For example, the larger assembly 'Cyclin-CDK complexes' is not listed because the detected association is with one of its components the 'Cdc28p complexes' only When associations are detected with more than one complex of a larger assembly, as is the case for the small and large subunits of the cytoplasmic ribosomes, the name of the larger assembly is given first, with no details of the identified associations But those are listed for each of the component complexes Information on the annotated regulons in (a) was obtained from the TRANSFAC and aMAZE databases, from the list compiled by Young and colleagues [16,48] and from the recent literature
S2a-Table 1 (Continued)
Statistically significant associations between annotated complexes and known regulons
Trang 5To determine whether the number of common components
for a given complex-regulon pair is above chance level, or
sta-tistically significant, we compute the expectation value
(E-value) of observing at least that number by chance, and retain
only pairs with an E-value below a certain threshold (see
Materials and methods)
Correspondence between regulons and annotated protein complexes
Table 1 lists the complex-regulon pairs whose overlap is above
chance level (E-value ≤ 0.01), obtained when mapping the
annotated complexes onto the annotated (Table 1a) and
high-throughput (Table 1b) regulons, respectively It is striking to
see that the 243 annotated complexes and 306 known
regu-lons form a total of only 57 pairs with a statistically significant
overlap Forty of those are with the annotated regulons, and
the remaining ones (only 17 in total) are with the
high-throughput regulons Those pairs involve only about 8% of
complexes (20 out of 243) and 14% of the regulons (44 out of
306) The overlap between known regulons and annotated
complexes is thus on the whole quite limited
Relating protein complexes to gene-expression data, Jansen
et al [7] found it useful to distinguish between two major
cat-egories of complexes 'Permanent' complexes are defined as
those that are detected under a wide range of different cellular
conditions, whereas 'transient' ones are defined as complexes
that form under a specific set of conditions While keeping in
mind that this division is probably oversimplified and could
sometimes be misleading, we follow these authors in
consid-ering it a helpful working hypothesis The list of complexes in
each category was derived from Jansen et al [7] with some
editing We classified complexes that did not clearly fit either
of the first two categories, and some larger assemblies
com-posed of several complexes, as 'other'
Table 1 reveals that meaningful overlaps between complexes
and known regulons occur for both permanent and
non-per-manent complexes Associations with the annotated regulons
involve fewer complexes of the permanent category than of
non-permanent ones (Table 1a) In contrast, the associations
with the high-throughput regulons involve more permanent
complexes than transient ones (Table 1b), in better agreement
with the reported stronger relations of permanent versus
transient complexes with mRNA expression profiles [7]
Another interesting observation is that the set of complexes
into which regulons map and the extent of overlap between
complexes and regulons is also quite different for the
anno-tated and high-throughput regulon datasets Regulons from
nucleosomal protein complex, ribonucleoside diphosphate
reductase and fatty-acid synthetase On the other hand,
com-plexes such as the proteasome, the Cdc28p cyclins and RNA
polymerase II are only involved in associations with
anno-tated regulons (Table 1a), whereas the ribosomal subunits or
cytochrome c oxidase complexes are only involved in
associa-tions with high-throughput regulons (Table 1b)
These and other differences are most likely to be due to thedifferent composition of the regulon repertoires in the twodatasets The annotated dataset contains nearly twice asmany regulons as the high-throughput one But the regulons
in the latter dataset are significantly larger, with on averagesix times more genes than in the annotated regulons (seeMaterials and methods) It is therefore not too surprising thatfor associations involving high-throughput regulons, the frac-tion of the components of a given complex covered by a regu-lon is in general higher than for annotated regulons It should
at the same time be cautioned that the high-throughput lons probably contain a fair number of spurious members(false positives) [26]
regu-Zoom-in on the overlaps between regulons and annotated complexes
We see that a complex is often associated with several lons This is due in part to the substantial overlap that oftenexists between the components of individual regulons Themost severe cases occur when different transcription factorsare annotated as regulating the exact same set of genes, a sit-uation that is often encountered for small regulons, and prob-ably results from incomplete information or because sometranscription factors act in combination or as complexes [35]
regu-We see for example that seven regulons map into the somal protein complex, six map into the ribonucleosidediphosphate reductase complex, and as many as 10 regulonsmap into the modular Cdc28p cyclin complexes (Table 1a)
nucleo-A given regulon also maps, in general, into more than onecomplex, often onto two, and occasionally onto three Thesemultiple associations form a patchy network, with several dis-connected clusters, which link complexes to regulons Thenetwork graphs built from the associations of the annotatedcomplexes, with annotated and high-throughput regulons,respectively, are illustrated in Additional data file 1 (FiguresS1 and S2)
Details of some of these clusters are illustrated in Figures 1and 2, highlighting the common genes involved The nucleo-somal protein complex (Figure 1a) has seven out of its eightcomponents in common with seven small regulons - Hta1/
Hta2, Spt10/Spt21 and Hir1/Hir2/Hir3 - whose genes tially overlap one another The ribonucleoside diphosphatereductase complex (Figure 1b) has all its four components incommon with a total of six partially overlapping regulons
par-The picture is significantly more complicated for the Cdc28p complexes (Figure 1c) As many as 10 regulons map
cyclin-into the 10 components of this complex: the Cln1 and Cln2
genes, which are regulated by as many as five different
tran-scription factors, and two trantran-scription-factor genes, Swi4 and Mcm1, also map into the glucan synthases and pre-repli-
cation complex, respectively
Trang 6Correspondence between regulons and high-throughput protein
complexes
The total number of statistically significant overlaps (E-value
≤ 0.01) is also very low (66 in total) when the known regulons
are mapped onto TAP complexes and HMS complexes, even
though the number of complexes considered is much larger
(725)
The majority of the complex-regulon pairs with meaningful
overlap (53) involve annotated regulons, whereas only 13
pairs involve high-throughput regulons Matches with
regu-lons from either dataset generally involve only a very small
subset of the complex components, and there are twice as
many matches with complexes from the HMS than from the
TAP datasets, in line with the larger size of the former dataset
(for a complete list of associations, see Additional data file 2
(Table S2))
Owing to the appreciable overlap between the components of
different complexes within and between the TAP and HMS
datasets, the network of associations between these
com-plexes and the regulons is much more intricate than for the
annotated complexes A network graph was built from the
larger set of 125 complex-regulon pairs with meaningful
over-laps (E-value ≤ 0.1) involving the annotated regulons (Figure
3) This network features seven separate dense clusters of
connections (Figure 3a-g) Details of the regulon-complex
overlaps in some of these clusters, highlighting the common
genes involved, are depicted in Figure 4a-c The remaining
clusters are detailed in Additional data file 1 (Figure S3) In
Figure 4h the set of remaining very small clusters, each
involving mostly one or two connections, is grouped
The first cluster (Figure 4a) corresponds chiefly to the overlap
between the Rpn4 regulon and 12 rather large complexes (six
TAP and six HMS complexes) Nine of the 11 genes of this
reg-ulon map onto these complexes All the complexes contain
components of the yeast proteasome, and some other
functionally related proteins in variable proportions estingly, six of the nine common genes correspond to proteinsfrom the 19S regulatory subunit, encoding four of the sixATPases in the subunit (Rpt2, Rpt4, Rpt5, Rpt6) A further
Inter-two genes, PRE6 and PRE2, code, respectively, for alpha and
beta subunits of the catalytic domain [36], and another gene
(RAD23) encodes a ubiquitin-like protein, which links DNA
repair to the ubiquitin/proteasome pathway [37]
The second cluster (Figure 4b) involves four partially ping regulons of three genes each, totaling five genes Thesegenes map into three medium-sized complexes (6-16 genes)and one large complex of 40 genes, with no more than two tothree genes mapping into the same complex Here, too, themajority of the five genes correspond to a biologically activeassembly - the ribonucleoside diphosphate reductase com-plex and associated kinase The third cluster (Figure 4c)involves genes of the nucleosomal protein complex A similaranalysis can be made for the remaining four clusters (data notshown), and similar observations are made when analyzingthe largest clusters in the network graph built from the 46 sta-tistically significant overlapping pairs (E-value ≤ 0.1) involv-ing the TAP and HMS complexes and high-throughputregulons (see Additional data files 1 and 2 (Figures S4, S5 andTable S2d, respectively))
overlap-This detailed analysis shows that although the subset of thecomponents of the multiprotein complexes that corresponds
to known regulons is usually quite small, it tends to be posed of proteins with close physical interactions and/orclear functional relations We also find that the bulk of theoverlaps involve genes that map into both permanent com-plexes such as the proteasome or the nucleosomal-proteincomplex, as well as into non-permanent ones, such as theribonucleoside diphosphate reductase and the cyclin-Cdc28pcomplexes No clear trends can therefore be identified fromthese data on the regulation of any one category of complexes
com-in particular
Detailed view of the main clusters in the network linking annotated protein complexes and regulons
Figure 1 (see following page)
Detailed view of the main clusters in the network linking annotated protein complexes and regulons The network (shown Additional data file 1 (Figure S1)) was built from the multiple links corresponding to associations with E-value ≥ 0.1, identified between the 243 CYGD yeast multiprotein complexes and the 200 annotated regulons (see text) Ellipsoid frames represent complexes, rectangular frame represent regulons, with individual complexes and regulons appearing in different colors in a given cluster Individual complexes are identified by their name in the CYGD complexes catalog [10] and regulons are denoted by the name of the bound transcription factor Genes involved in complexes or regulons are enclosed, respectively, in rounded frames or rectangles of the same color as the complex or regulon, and are displayed by their common name The two digits given in parentheses indicate
the number of genes involved in this cluster for the complex or regulon, and the total number of genes in the complex or regulon, respectively (a)
Cluster involving associations between three groups of regulons (Hta1-Hta2, Hir1-2-3, and Spt10-Spt21) and seven of the eight genes of the nucleosomal
protein complex (b) The ribonucleoside diphosphate reductase cluster, involving associations between all four genes of the corresponding complex and four groups of co-regulated genes belonging to six regulons (c) Cluster involving associations between all the 10 components of the Cdc28p complexes,
and seven distinct groups of genes belonging to 11 regulons Five regulons - Cln3, Sit4, Spt16, Bck2, and Swi4 - map onto the exact same cyclin genes (CLN, CLN2) Two regulons, Swi4, and Mcm1, map also into the glucan synthases and pre-replication complex, respectively.
Trang 7Swi4 (4/8) Azf1 (2/2)
Sit4 (2/2)Cln3 (2/2)
Spt16 (2/2)Bck2 (2/2)
Swi6 (3/10)
CDC28 CLN3
CLB3
CLB6
CLB1 CLB4
CLN2
CLN1
CLB2 CLB5
RNR2
Mbp1 (2/6)
Ribonucleoside diphosphatereductase (4/4)
RNR4
Rfx1 (3/5)
Rad9 (2/2)Yku70 (2/2)
FKS1 KRE6
Glucan synthases(2/5)
CDC6 CDC46
Pre-replicationcomplex (2/14)
(c)
Trang 8Figure 2 (see legend on next page)
Hap 25/69
COR1 COX9
QCR7 RIP1
QCR9 COX6
COX12
COX8 COX7 COX5A
CYT1 QCR2
Hap3 (4/23)Hap2 (5/19)
Cytochrome c oxidase(8/8)
Cytochrome bc1complex (7/9)
ATP5
ATP17 ATP15
Hir1 (6/30)Hir2 (6/21)Met4 (2/29)
Nucleosomal proteincomplex (7/8)
CBL6
CDC7
CDC6 RAD27 CDC45 RFA2
(a)
(b)
(c)
Trang 9The very limited overlap between complexes and regulons
detected above might be biologically meaningful, or might be
due to the limited information that is currently available on
the nature of protein complexes and regulatory networks in
yeast Given these uncertainties, it seemed of interest to
complement the above analyses by an approach in which
reg-ulons are directly predicted from the components of protein
complexes
If the components of a given protein complex are co-regulated
on the transcriptional level, one would expect to find common
regulatory sequence elements, corresponding to
transcrip-tion factor binding sites, in the upstream regions of the
corre-sponding genes The problem of identifying regulatory sites is
notoriously difficult [33] To tackle it we applied algorithms
for the discovery of oligonucleotides (here, hexanucleotides)
[24] and spaced pairs of trinucleotides [25], which occur
more frequently in the upstream regions of the genes coding
for the components of each complex than in the
correspond-ing regions across the entire yeast nuclear genome For this
approach we considered only complexes with at least five
components
Highly significant patterns are detected for only a small subset of the
complexes
Figure 5 plots the number and fraction of the protein
com-plexes in each of the three datasets examined (the annotated,
TAP and HMS complexes) for which regulatory-sequence
patterns were identified by our prediction method using three
different reliability thresholds Plotted alongside are the
cor-responding results obtained here for sets of randomly
selected genes (used as negative control) and results for
known regulons (positive control) obtained in another study
[26]
A first observation is that the fraction of complexes for which
regulatory patterns are identified with some reliability is
quite low No more than 27-28% of the complexes from either
of the three analyzed datasets have at least one pattern with
statistical significance Sig ≥ 1 (corresponding to an E-value ≤
0.1) At this threshold the fraction of complexes with
identi-fied patterns is nonetheless about 7-10% higher than for gene
groups selected at random With the more stringent cance threshold (Sig ≥ 2), the fraction of complexes with atleast one pattern drops further, but less for the curated (15%)and TAP complexes (13%), than for the HMS complexes (8%)
signifi-We recently applied the same algorithms to the dataset ofannotated regulons [26] As the genes belonging to the sameregulon are expected to be co-regulated and hence to exhibitcommon regulatory-sequence patterns, our algorithmsshould perform well on these genes This was indeed the case
Patterns with Sig = 1 were identified in as many as 84% of theannotated regulons, as illustrated in Figure 5
The fraction of the complexes in which regulatory patternscan be reliably detected is thus clearly much smaller, confirm-ing that the components of complexes are on average muchless consistently co-regulated than the genes that belong toknown regulons
Assigning components of protein complexes to putative regulons on the basis of predicted patterns
Having shown that highly reliable regulatory patterns can bedetected in genes corresponding to at least a fraction of thecomplexes, we now proceeded to determine, for each com-plex, which of its components are likely to be co-regulated,and what fraction of the complex they represent To this end,complexes with at least five component genes, featuring atleast one significant pattern (Sig ≥ 1), are selected A stepwiselinear discriminant analysis [38] with a leave-one-out proce-dure is applied to assign a gene involved in a given complex,either to its original complex or to a control group of ran-domly selected genes, according to the number of occurrences
of the discovered patterns in its upstream region Theassigned group (complex or control) is then compared to thegroup from which the gene was drawn to evaluate the cover-age and positive predictive power (PPP) of the assignment
Coverage is defined as the fraction of the total number ofgenes in the complex that were reassigned to it by the discri-minant procedure PPP is defined as the fraction of totalnumber of genes assigned to the complex that actually belong
to it (see Materials and methods for details)
Figure 6 displays the coverage versus PPP values for a total of
140 individual complexes from the three datasets analyzed
Detailed view of the main clusters in the network linking annotated protein complexes and high-throughput regulons
Figure 2 (see previous page)
Detailed view of the main clusters in the network linking annotated protein complexes and high-throughput regulons The network was built considering
all the associations with E-value ≤ 0.1; regulons and complexes are denoted and depicted as described in the legend of Figure 1 (a) Cluster of associations
involving seven of the eight components of the nucleosomal protein complex Unlike in the equivalent cluster of Figure 1a, here only two distinct groups
of, respectively, two and six genes belonging to three rather large regulons (respectively, Met4 and Hir1-2) map into this complex Note that here Hir1-2
comprises a much larger group of genes than in the annotated regulons (b) Cluster of the respiratory chain complexes It comprises three complexes: the
F0-F1-ATP-synthase complex, and the cytochrome bc1- and cytochrome c oxidase complexes Twenty-five genes of the Hap4 regulon, and four and five
genes of the Hap3 and Hap2 regulons, respectively, map into these complexes As noted in the text, the Hap4 transcription factor is known as a
respiratory-chain activator that does not bind DNA but fosters DNA binding by Hap2 and Hap3 [45]) The reasons for the more limited overlap between
these latter two regulons and components of the respiration complexes are not clear (c) An interesting cluster where the main node is the large Mbp1
regulon of 112 genes, of which 10 overlap with components of three complexes: the small replication factor A complex (3 genes), the replication
complexes (49 genes) and the Cdc28p complexes (10 genes).
Trang 10Figure 3 (see legend on next page)
Hir1(4) 2
3
TAP139(43)
Abf1(37) 4
Gcr1(18) 3 Rap1(32)
4
TAP145(19) 3
TAP148(34)
7 TAP157(36) 7
TAP159(50) 4
Reb1(19) 3
TAP162(36)
2
TAP18(3)
2 2 2
Sto1(1) 1
Tye7(6 ) 2
TAP31(16) Msn2(56)
3
Msn4(58) 3
TAP33(3)
Gcn4(40) 2
Rad26(1) 1
TAP47(3)
Arg82(1) 1
TAP62(13)
3
Gcr2(11) 2
4
TAP68(8)
Hap2(14) 2
Hap3(15) 2
TAP83(10)
Bck2(2) 2
Cln3(2) 2 Far1(2) 2
Sit4(2) 2
Spt16(2) 2
Swi4(8) 2 Swi6(10) 2
TAP86(19) 3
TAP88(9) 2
2 Spt10(3) 2
Spt21(3) 2
Spt6(2) 2
HMS106(8)
Snf2(6) 2
Swi1(5) 2
HMS111(10)
3 3
HMS126(11) 3
HMS184(13)
2 2
HMS188(2)
Mcm1(14) 2
HMS210(55)
Yap1(32) 4
HMS219(3)
Htb1(1) 1
Spt4(1) 1
Spt5(1) 1
HMS220(27) 5 HMS223(15) 2
HMS234(40)
Dun1(3) 2
HMS248(3)
Ddc1(1) 1
HMS26(14)
3 3
HMS55(27) 2
Hir3(3) 2
HMS286(38) 2
HMS29(18) 2
HMS293(2)
Rad53(1) 1
HMS300(21)
Rtg3(6) 2
HMS303(3) 2
Snf3(1) 1
2
HMS349(8)
2
2 2
HMS356(5) 2
2
HMS365(18) 2
HMS373(53)
2 2
HMS391(12) 2
HMS407(4) 2
HMS422(7) 2
HMS424(5) 2
HMS466(16) 2
2 2
HMS468(10) 2
Ndt80(11) 2 2
Xbp1(5) 2
HMS50(45)
4
HMS51(27) 4
HMS273(50) 2
HMS84(47)
Bas1(17) 3
5 5
Pho2(21) 3 4
HMS98(6)
2 2
Trang 11here (34 TAP, 75 HMS, and 31 annotated ones) The coverage
obtained for these complexes has a mean value of 48%, and a
standard deviation of about 25% The mean PPP is 80%, with
a standard deviation of about 10%, and only a single case with
perfect assignment (PPP = 100%) There is very little
differ-ence between the results obtained for the annotated, TAP,
and HMS complexes (see Additional data file 2 (Table S3) for
details) It is noteworthy that significantly higher average
val-ues for the coverage and PPP (72% and 92% respectively)
were obtained by applying the same procedure to the
anno-tated regulons [26]
Putative regulons in the annotated complexes
We determined whether the putative regulons identified by
our procedures can provide useful information on the
transcriptional regulation of protein complexes As a first
step, we discuss several aspects of the prediction results for
patterns and putative regulons obtained for the annotated
complexes, summarized in Table 2 This lists the results for all
the complexes for which at least one statistically significant
(Sig ≥ 1) regulatory pattern has been detected A complete list
of the predicted co-regulated components in each of the
com-plexes considered is given in Additional data file 2 (Table S4)
Table 2 reveals a clear difference between results for the
per-manent and the non-perper-manent complexes Most strikingly,
the fraction of the components of a given complex covered by
our putative regulons is noticeably higher for most
perma-nent complexes (0.7-1.0) than for the non-permaperma-nent ones
(0.06-0.6) The number of significant regulatory patterns and
the significance value of the 'best' pattern are also generally
higher in theses complexes Among the complexes with the
highest coverage by putative regulons and a large number of
statistically significant patterns we find the proteasome, the
large and small subunits of the cytoplasmic ribosome, three
complexes of the respiratory chain, the translation elongation
complex, as well as the nucleosomal protein and cyclin
Cdc28p complexes To illustrate the information provided by
our approach, we will discuss in detail our findings for the
nucleosomal protein complex and the replication fork
complexes
Nucleosomal protein complex
This complex has all of its eight components predicted to be
part of a regulon, with a large number (20) of significant
patterns Details of the patterns discovered, of which the most
statistically significant are spaced dyads, and their locations
in the upstream regions of the corresponding genes areshown in the feature map (Figure 7) All but one of thesedyads are mutually overlapping, and can be aligned to formthe larger motif cGCGAan{5}caGAACg, where upper-case let-ters denote the most conserved residues, which seem to be the'core' of the binding site, and the number in brackets is thelength of the spacer in terms of the number of interveningnucleotides The feature map shows that each upstreamsequence contains at least two occurrences of this 'core', withsome differences in the bases flanking this core Althoughseveral regulons - Hta1/Hta2, Hir2/Hir3/Hir4 and Spt10/
Spt21 - are known (and were found here) to map into thiscomplex, covering a total of seven out of the eight compo-nents of the complex (Figures 1, 2), our findings represent thefirst instance where a regulatory motif is proposed for all themembers of the nucleosomal complex
Replication fork complexes
The replication fork complex is an assembly of proteinsinvolved in DNA replication (Table 2) It is encoded by a total
of 30 genes, which can be subdivided into several smallercomplexes such as the DNA polymerase δ, DNA polymerase ε,DNA α1 primase and replication factor C complexes Analysis
of the entire assembly detected 12 patterns with a maximumsignificance of 13.3, corresponding to an E-value of 2 × 10-13.The discriminant analysis carried out on the basis of thesepatterns allowed us to assign about half (17) of the 30 compo-nents of this assembly to putative regulons (Table 2)
Table 3 lists the probabilities for individual components to beassigned to the complex by the discriminant analysis Itreveals a striking observation: the predicted co-regulatedgenes correspond almost perfectly to seven out of the 14individual complexes or entities that make up the assembly
The 17 genes that belong to the putative regulons includethree of the four components of the DNA polymerase α1 pri-mase complex, all the components of the DNA δ and εcomplexes, the replication factor A and topoisomerase com-plexes, as well as the proliferating cell nuclear antigen(PCNA) and exonucleases Furthermore, the majority of thesegenes were assigned to the replication fork assembly withhigh probability (0.8-0.99)
Interestingly, Jansen et al [7] reported a poor correlation
with expression data for the replication complex, a large
com-Network graph of the statistically significant links between the TAP and HMS complexes and annotated regulons
Figure 3 (see previous page)
Network graph of the statistically significant links between the TAP and HMS complexes and annotated regulons Each node represents a complex (red
ellipse) or a regulon (blue rectangle) Individual complexes are identified by a number, prefixed by TAP [11] or HMS [12] Regulons are denoted by the
name of the bound transcription factor The number of genes in each group (complex or regulon) is given in parentheses The number of genes common
to a given complex-regulon pair is indicated along the lines (arcs) joining the pair (a-g) Seven dense clusters of connections (h) The set of remaining very
small clusters are grouped, each involving mostly one or two connections Clusters (a-c) are detailed in Figure 4.