Approxi-mately 9700 SDRs form clusters that are too small fewer than 20 members with maximum 80% sequence identity for them to be reliably identified with an HMM, but these will hopefully
Trang 1superfamily using hidden Markov models
Yvonne Kallberg1,2, Udo Oppermann3and Bengt Persson1,2,4
1 IFM Bioinformatics, Linko¨ping University, Sweden
2 Department of Cell and Molecular Biology (CMB), Karolinska Institutet, Stockholm, Sweden
3 Structural Genomics Consortium, The Botnar Research Centre, NIHR Biomedical Research Unit, University of Oxford, UK
4 National Supercomputer Centre (NSC) and Swedish E-science Research Centre (SERC), Linko¨ping University, Sweden
Introduction
The short-chain dehydrogenase⁄ reductase (SDR)
superfamily, recently reviewed in [1], consists of
NAD(P)(H)-dependent oxidoreductases that are distinct
from the medium-chain dehydrogenase and aldo–keto
reductase (AKR) superfamilies The term SDR was
coined in 1991 [2], and the enzyme family has been
shown to be present in all domains of life, from
primi-tive bacteria to higher eukaryotes Interestingly, about
25% of all identified dehydrogenases belong to the
SDR superfamily [3] Furthermore, in the ocean
sequence sampling by Venter et al [4], this superfamily
was found to be the largest, with over 60 000
non-redundant sequences (over 30 000 of the ‘classical’ type
and close to 30 000 of the ‘extended’ type)
The SDR superfamily currently has more than
47 000 primary structures available in sequence data-bases and over 300 crystal structures deposited in the Protein Data Bank They show early divergence, the majority of family members having only low pairwise sequence identity (typically 20–30%), but have several properties in common, described in [1,2] The three-dimensional structures are clearly homologous with a single-domain globular Rossmann-related fold consist-ing of a b-sheet sandwiched between three a-helices on each side The active site is formed by a triad⁄ tetrad with highly conserved Tyr, Lys, Ser (and Asn) residues [1,5] Substrate binding occurs in a cleft close to the coenzyme-binding site This cleft shows considerable
Keywords
bioinformatics; classification; genomes;
hidden Markov model; short-chain
Correspondence
B Persson, IFM Bioinformatics, Linko¨ping
University, S-581 83 Linko¨ping, Sweden
Fax: +46 13 137 568
Tel +46 13 282 983
E-mail: bpn@ifm.liu.se
(Received 23 August 2009, revised 12
February 2010, accepted 16 March 2010)
doi:10.1111/j.1742-4658.2010.07656.x
The short-chain dehydrogenase⁄ reductase (SDR) superfamily now has over
47 000 members, most of which are distantly related, with typically 20– 30% residue identity in pairwise comparisons, making it difficult to obtain
an overview of this superfamily We have therefore developed a family clas-sification system, based upon hidden Markov models (HMMs) To this end, we have identified 314 SDR families, encompassing about 31 900 members In addition, about 9700 SDR forms belong to families with too few members at present to establish valid HMMs In the human genome,
we find 47 SDR families, corresponding to 82 genes Thirteen families are present in all three domains (Eukaryota, Bacteria, and Archaea), and are hence expected to catalyze fundamental metabolic processes The majority
of these enzymes are of the ‘extended’ type, in agreement with earlier find-ings About half of the SDR families are only found among bacteria, where the ‘classical’ SDR type is most prominent The HMM-based classification
is used as a basis for a sustainable and expandable nomenclature system
Abbreviations
dehydrogenase.
Trang 2variation between the individual SDR forms,
explain-ing the wide substrate spectrum of this enzyme
super-family
In humans, there are 202 SDR forms, corresponding
to at least 82 SDR genes; they have important
func-tions in steroid hormone, prostaglandin and retinoid
metabolism, and hence signaling They also play
cru-cial roles in the metabolism of xenobiotics, including
drugs and carcinogens A growing number of
single-nucleotide polymorphisms have been assigned to SDR
genes Of the 77 human SDRs that are listed in the
well-annotated database Swiss-Prot, 24 enzymes are
associated with diseases in the OMIM database
(Online Mendelian Inheritance in Man) Thus, many
(or even all) of these enzymes are medically important
However, the functions of about half of the human
SDR enzymes are unknown
The SDR superfamily has grown by several orders
of magnitude, from 20-odd members in 1991 [2] to
over 47 000 today, and thanks to the fast progress in
genome and environmental sequencing projects, the
number of known SDR forms can be expected to
increase even more in the future This substantiates the
need for a subdivision into families to achieve a
sys-tematic overview and allow for annotations and for
functional conclusions In this article, we apply hidden
Markov models (HMMs) to obtain a sequence-based
subdivision of the SDR superfamily that allows for
automatic classification of novel sequence data and
provides the basis for a nomenclature system
Results and Discussion
The large SDR superfamily is of ancient origin, with
most members being equidistantly related at the 20–
30% residue identity level Consequently, there are no
natural hierarchical relationships to rely on for the
functional assignments HMMs have successfully been
used in protein family characterization [6], and their
use is now standard when new sequences are being
annotated They are used in our functional
categoriza-tion of all SDR members, where each SDR family
cor-responds to an HMM, and this set of resulting HMMs
forms the basis for a sustainable nomenclature scheme
for the whole SDR superfamily
So far, 314 families have been defined, covering
about 31 900 of 47 000 retrieved SDRs
Approxi-mately 9700 SDRs form clusters that are too small
(fewer than 20 members with maximum 80% sequence
identity) for them to be reliably identified with an
HMM, but these will hopefully be extended as new
genomes become sequenced The remaining SDRs
( 6800) will be investigated henceforth
Family size The numbers of members in the different families identified vary considerably, but the majority of the families are quite small Over half of the SDR fami-lies have fewer than 60 members, and 77% of the fam-ilies have fewer than 100 members (Fig 1) Large families are rare; there are only 16 families with 400 or more members (Table 1) They are primarily of the
‘extended’ SDR type (nine families), and several of these members metabolize carbohydrate derivatives, a basic function common to most life forms Two such families are the GDP-mannose-4,6-dehydratases (SDR3E in the nomenclature system) and the
GDP-l-fucose synthetases (SDR4E), which are involved
in the two-step conversion of GDP-mannose to GDP-l-fucose The latter is a substrate for several fucosyltransferases, which in turn are involved in the expression of many glyconjugates [7,8] Another exam-ple is provided by the UDP-glucose-4-epimerases (SDR1E), which catalyze the third and final step in the Leloir pathway of galactose metabolism, interconvert-ing UDP-galactose and UDP-glucose Impairment of this enzyme reaction leads to epimerase-deficiency galactosemia [9], which can lead to, for example, mental retardation or cataract in humans
In 16th place among the largest families are the insect alcohol dehydrogenases (SDR109I) These form
a very specialized group of enzymes, also called the
‘intermediate’ SDR type [10] This seems to be unique
to insects, and the size of this family is due to the very well-studied genomes of fruit flies The 78 different
0 20 40 60 80 100 120
20–39 40–59 60–99
100–199 200–299 300–499
500-Members
Fig 1 SDR family sizes The bars represent the number of SDR families of defined family sizes The most common family size is between 20 and 39 members.
Trang 3Drosophila genomes alone account for more than half
of the members (228 of 404)
Human SDR members
Of the 314 families identified, 37 have human
mem-bers Ten additional human SDR families have been
identified, but these have too few members (< 20
with < 80% pairwise residue identity) to be suitable
for HMM analysis at present In total, the 47
fami-lies represent 82 SDR genes (Table 2) In Table 3,
the numbers of genes for each SDR family in
human, mouse and rat are compared For most
fam-ilies, the numbers are identical for the three species
Of the 13 families with two or more genes in
humans, nine have at least two genes also in mouse
and⁄ or rat
Regarding family size in relation to number of
human members, one would imagine a linear
correla-tion: the larger the size, the more human members
However, typically, the SDR families with more than
400 members have only a single human representative
and, in total, as many as 34 of the 47 families have
only one human member There are four families, with
retinol and steroid dehydrogenases, that stand out:
SDR7C, SDR11E and SDR16C, with their six human
members each, and SDR9C, with as many as eight
human members This observation emphasizes the
critical importance of enzymatic control of retinoid
and steroid metabolism in development as well as
metabolic and homeostatic signaling [11] In this
context, control of ligand access, by oxidoreductases such as the above-mentioned SDRs, to nuclear hor-mone receptors such as retinoid or steroid receptors appears to constitute an important determinant of ste-roid, retinoid and lipid signaling, and seems to necessi-tate the existence of such diversified enzyme forms to maintain and execute proper functions in multicellular eukaryotes and mammals
Species distribution
A closer look at the distribution among the classified SDRs in the domains Eukaryota, Bacteria and Archaea (Fig 2) reveals that more than half of the families are unique to bacteria (178 of 314) The two largest of these, SDR56X with 557 members and SDR61X with
245 members, have multidomain enzymes in the form
of polyketide synthases They typically have two NADP(H)-binding domains, and the two SDR families cover one domain each
One-fifth (63) of the families have members among both bacteria and eukaryotes but not among archaea-ons Archaeal SDRs are quite rare, only 32 families having any such member (Fig 2) There are three fam-ilies with much higher proportions of archaeal mem-bers than are generally found Two of them comprise extended SDRs (SDR136E UDP-glucose-4-epimerase and SDR144E UDP-glucose homolog), and the third family contains a classical SDR (SDR146C 3-oxoacyl reductase) There is no single SDR family unique to archaeaons
Table 1 The 16 largest SDR families The largest families, with currently more than 400 members, are listed In the domain columns, the letters E, B and A denote eukaryotic, bacterial and archaeal genomes, respectively.
Domain
Trang 4Family designation
Family size
Accession number
Trang 5Family designation
Family size
Accession number
Trang 6There are few families with representatives from all three domains, only 14 of 314 (Table 4) Seven of them have mammal (and human) representatives Nine families are of the extended type, which is more than expected by chance, as the classical type is most com-mon ( 69%) Therefore, it seems that families of the extended SDR type are represented in more species than the classical type, in agreement with early genome investigations [12] Interestingly, there is one family (SDR53C, related to glucose dehydrogenases) with only 38 members that still has members from all domains, in spite of its small size Typically, the bacte-rial members form the vast majority (80% or more) in most SDR families identified, in agreement with the fact that 79% of the SDRs are from that domain However, in two of these 14 families, the eukaryotic members are in the majority (SDR51C, l-xylulose reductases; and SDR53C, glucose-1-dehydrogenase-related proteins)
There are 41 families with only eukaryotic members (Table 5) Around half of them are unique to one group of species; insect alcohol dehydrogenases consti-tute one such family, there are seven families unique to plants, and as many as 15 families are unique to fungi Sixteen of the remaining families, with members from multiple groups of species, have mammalian (and human) representatives These families include several
of the steroid dehydrogenases⁄ reductases and carbonyl and fatty acyl reductases
SDR types SDRs are divided into the types ‘classical’ and
‘extended’ [10], and it was previously noted that classi-cal SDRs are more common; however, among SDRs that are present in all eukaryotes, the extended type is equally common [13] Now we are able to make a large-scale comparison, including not only eukaryotes but also prokaryotes Of the 314 families identified, there are 218 families judged to be classical and 52 extended (68% and 17%, respectively) In total, these cover about 27 900 proteins, and surprisingly, given that the majority of the families are classical, 36% of the proteins are of the extended type This means that many of the largest families are of the extended type Classical SDRs are in the vast majority in families with members from only one domain (Eukaryota or Bacte-ria) and also in families with both eukaryotic and bac-terial members When archaeal members are involved, however, the pattern changes considerably Among the
14 families with members from all three domains, only five are classical; that is, the extended type is in the majority One reason for this could be that the
Family designation
Family size
Accession number
Trang 7extended SDRs are typically involved in basic
meta-bolic functions, and thus have a lower tendency to
vary Classical SDRs, on the other hand, are involved
in many different types of enzyme reactions, and are
thus more diverse [1]
Family correlation with functional annotations Among the identified families, only one-third of the proteins have informative annotations; the other two-thirds have terms such as putative, hypothetical, or
Table 3 SDR members in human, mouse, and rat.
Genes per family
Trang 8possible⁄ probable, or are only identified as SDR
pro-teins One advantage of the present family grouping is
that functional annotations can be concluded from
other family members, as many families have at least a
few members with annotations describing their
func-tions In order to investigate whether family functions
could be derived, annotations for each member were
compared within the families In families where the
described functions are quite general, we find no
inconsistencies; that is, the annotations (if present) are
the same for every member in a family, thus
support-ing the present classification
We also find a good correlation between the present
family classification and the function(s) among families
for which a more detailed functional role is known,
predominantly families with human and mammalian members In families with a single human member (34 families), there are no contradictory functions anno-tated; that is, for all members with known function, the annotation is the same There are some members that might have another function, according to the protein description, but the function seems to be derived rather than actually determined For instance,
in family SDR6E, there are two members described as GDP-mannose-4,6-dehydratase and 3b-hydroxysteroid dehydrogenase⁄ isomerase (accession numbers Q00VJ3 and A0ZGH3), respectively, suggesting that they belong to families SDR3E and SDR11E instead, but there are no experimental data, and pairwise sequence comparisons clearly identify the human representative for SDR6E (UXS1_HUMAN) as the closest human ortholog
The 13 families with multiple human members typi-cally contain one or several 17b-hydroxysteroid dehy-drogenases (17b-HSDs) (see, for example, [14,15] for overviews of functions) These types of enzyme have mixed origins; one of them (type 5) is not even an SDR, but belongs to the AKR family, and phyloge-netic studies have shown that 17b-HSD activity has evolved from different ancestors, e.g in types 1, 2 and
3 (corresponding to families SDR28C, SDR9C and SDR12C, respectively; see [16] and references therein) These studies also provide support for the inclusion
of retinol dehydrogenases, an 11b-hydroxysteroid dehydrogenase and 17b-HSDs in the SDR9C family,
as they most probably have a common ancestor among
B
E
41
178 0
63 0
18 14
A
Fig 2 Number of SDR families with members representing one,
two or three of the domains of life The numbers represent the
numbers of families with members in different combinations of the
three domains Eukaryota (E), Bacteria (B) and Archaea (A).
Table 4 The 14 SDR families present in all domains The average ratio column shows the average number of members per species The letters E, B and A denote eukaryotic, bacterial and archaeal genomes, respectively The numbers represent percentage of members from each domain Families with human occurrences are indicated by bold type in the eukaryotic column.
Percentage in domain
Trang 9Table 5 SDR families unique to eukaryotes The average ratio column shows the average number of members per species M, Fi, I, P, Fu and O denote mammals, fish, insects, plants, fungi and other eukaryotes, respectively.
Family name
Family designation
Family size
Number
of species
Average ratio
Species group
NADP(H)-dependent 17b-hydroxysteroid
dehydrogenase (SDR12C)
Multisubstrate SDR9C with preference
for NAD(H)
NADPH-dependent methylglyoxal reductase
GRE2
NADH dehydrogenase (ubiquinone) 1a
subcomplex 9
Fatty acid synthase a subunit FasA,
3-oxoacyl-(acyl-carrier protein) domain
NAD(H)-dependent 17b-hydroxysteroid
dehydrogenase (SDR30C)
putative
protein
SDR130C
Putative short-chain type alcohol dehydrogenase
SDR132C
family
Trang 10the invertebrates Thus, the family classification seems
to be valid also for these families
For every member in the first 47 families, we also
made sequence comparisons with all identified human
SDRs In every family except two, all of the members
have the largest number of identities with the human
representatives in their own family The first exception
is the retinol dehydrogenase family SDR7C, where we
find a total of 41 members (of 262) scoring higher
towards human SDR41C1 (WWOX_HUMAN) than
any of its own human representatives During HMM
training, some overlaps were found between these
clus-ters, but as we were unable to create a single HMM
that captured every member in the two clusters, it was
decided to keep them separate for now It is possible
that these two families have the same ancient origin,
and hence should be one family instead The other
exception is SDR8C, comprising peroxisomal
multi-functional enzymes, where 25 members (of 360)
pre-ferred SDR30C1 (DHB8_HUMAN) This is in spite
the fact that HMM iteration with these proteins as
seed led to the inclusion of human SDR8C1 and not
SDR30C1 The SDR8C family is primarily involved in
fatty acid metabolism, and has a multidomain
struc-ture, with an N-terminal SDR domain followed by a
hydratase domain, and finally a sterol carrier protein 2
domain Members of the SDR30C family consist of a
single SDR domain; the exact function has yet to be
discovered, but both fatty acid and steroid metabolism
have been suggested Thus, with the knowledge
avail-able today, it is not possible to evaluate these families
further
Other SDR classifications
As mentioned in Experimental procedures, Pfam [6]
identifies SDRs through the three profiles Adh_short
(PF00106), Epimerase (PF01370) and 3Beta_HSD
(PF01073), thus classifying these proteins at a much
more general level, which does not allow the more
fine-grained analysis regarding the presently identified
families Identifying members of the SDR superfamily
is, of course, a necessary step in order to be able to
cluster and divide them further, but does not give us
insights into the function of a specific protein, owing
to the large variation in functionality among SDRs
Also, the general HMMs might not correctly identify
sequence fragments, which more specialized HMMs
can About 1600 SDRs were identified in this way, i.e
not found by the general SDR HMMs but by the
family HMMs only
Another approach uses evolutionary trees [17] to
achieve subfamily classification In comparison with
the method presented herein, this approach arrives at much more fine-grained families; for example, our 3b-hydroxysteroid dehydrogenase family is split into eight subfamilies, and our family with retinol
dehydrogenas-es is split into as many as 19 subfamilidehydrogenas-es A classifica-tion system that is too specific would be impractical,
as it would not provide a correct overview of the divergent SDR superfamily Furthermore, functional conclusions drawn from family members would be of less practical value with smaller families, owing to limi-tations in annolimi-tations
Our HMM system as basis for nomenclature The presently characterized SDR families form a natu-ral foundation for a nomenclature system We have therefore, together with a number of researchers in the SDR field, created such a nomenclature system [18], which is already in use [19] This nomenclature will help us to keep track of the different SDR families, and facilitate collection of knowledge on the structural and functional properties of one of the largest protein families known to date
Experimental procedures
A number of HMMs were developed in order to arrive at
a subclassification of the SDR superfamily There are already HMMs (three Pfam HMMs and an HMM previ-ously developed by us) for the identification of new SDR members in general The purpose of the HMMs now developed is to divide the SDRs into more manageable subfamilies with a more specialized function in common
HMMs were developed using an iterative approach to arrive at stable HMMs that correctly identifies their own members and disregards members of other SDR families (see below)
SDR proteins were extracted from the Uniprot database [20], human RefSeq [21] and human Ensembl [22] as of 1 October, 2008, using a previously developed HMM [10] and the Pfam [6] profiles PF00106, PF01073 and PF01370 This dataset consisted of 47 011 proteins (7905 only by our own method, and 6254 only by Pfam) In addition, 1581 proteins have so far been identified by the HMMs now developed
In order to identify clusters of SDR families, each of the candidate sequences was compared with all of the other candidates using fasta [23] We tested clustering at various levels, and found that an initial clustering at the 40% level and an opt-score better than 700 were most appropriate, as judged from test cases with SDR enzymes of known func-tion Furthermore, the 40% level has also been shown to
be suitable for other classification (nomenclature) systems,