Báo cáo khoa học: Classiﬁcation of the short-chain dehydrogenase ⁄reductase superfamily using hidden Markov models potx

Approxi-mately 9700 SDRs form clusters that are too small fewer than 20 members with maximum 80% sequence identity for them to be reliably identiﬁed with an HMM, but these will hopefully

Trang 1

superfamily using hidden Markov models

Yvonne Kallberg1,2, Udo Oppermann3and Bengt Persson1,2,4

1 IFM Bioinformatics, Linko¨ping University, Sweden

2 Department of Cell and Molecular Biology (CMB), Karolinska Institutet, Stockholm, Sweden

3 Structural Genomics Consortium, The Botnar Research Centre, NIHR Biomedical Research Unit, University of Oxford, UK

4 National Supercomputer Centre (NSC) and Swedish E-science Research Centre (SERC), Linko¨ping University, Sweden

Introduction

The short-chain dehydrogenase⁄ reductase (SDR)

superfamily, recently reviewed in [1], consists of

NAD(P)(H)-dependent oxidoreductases that are distinct

from the medium-chain dehydrogenase and aldo–keto

reductase (AKR) superfamilies The term SDR was

coined in 1991 [2], and the enzyme family has been

shown to be present in all domains of life, from

primi-tive bacteria to higher eukaryotes Interestingly, about

25% of all identiﬁed dehydrogenases belong to the

SDR superfamily [3] Furthermore, in the ocean

sequence sampling by Venter et al [4], this superfamily

was found to be the largest, with over 60 000

non-redundant sequences (over 30 000 of the ‘classical’ type

and close to 30 000 of the ‘extended’ type)

The SDR superfamily currently has more than

47 000 primary structures available in sequence data-bases and over 300 crystal structures deposited in the Protein Data Bank They show early divergence, the majority of family members having only low pairwise sequence identity (typically 20–30%), but have several properties in common, described in [1,2] The three-dimensional structures are clearly homologous with a single-domain globular Rossmann-related fold consist-ing of a b-sheet sandwiched between three a-helices on each side The active site is formed by a triad⁄ tetrad with highly conserved Tyr, Lys, Ser (and Asn) residues [1,5] Substrate binding occurs in a cleft close to the coenzyme-binding site This cleft shows considerable

Keywords

bioinformatics; classification; genomes;

hidden Markov model; short-chain

Correspondence

B Persson, IFM Bioinformatics, Linko¨ping

University, S-581 83 Linko¨ping, Sweden

Fax: +46 13 137 568

Tel +46 13 282 983

E-mail: bpn@ifm.liu.se

(Received 23 August 2009, revised 12

February 2010, accepted 16 March 2010)

doi:10.1111/j.1742-4658.2010.07656.x

The short-chain dehydrogenase⁄ reductase (SDR) superfamily now has over

47 000 members, most of which are distantly related, with typically 20– 30% residue identity in pairwise comparisons, making it difﬁcult to obtain

an overview of this superfamily We have therefore developed a family clas-siﬁcation system, based upon hidden Markov models (HMMs) To this end, we have identiﬁed 314 SDR families, encompassing about 31 900 members In addition, about 9700 SDR forms belong to families with too few members at present to establish valid HMMs In the human genome,

we ﬁnd 47 SDR families, corresponding to 82 genes Thirteen families are present in all three domains (Eukaryota, Bacteria, and Archaea), and are hence expected to catalyze fundamental metabolic processes The majority

of these enzymes are of the ‘extended’ type, in agreement with earlier ﬁnd-ings About half of the SDR families are only found among bacteria, where the ‘classical’ SDR type is most prominent The HMM-based classiﬁcation

is used as a basis for a sustainable and expandable nomenclature system

Abbreviations

dehydrogenase.

Trang 2

variation between the individual SDR forms,

explain-ing the wide substrate spectrum of this enzyme

super-family

In humans, there are 202 SDR forms, corresponding

to at least 82 SDR genes; they have important

func-tions in steroid hormone, prostaglandin and retinoid

metabolism, and hence signaling They also play

cru-cial roles in the metabolism of xenobiotics, including

drugs and carcinogens A growing number of

single-nucleotide polymorphisms have been assigned to SDR

genes Of the 77 human SDRs that are listed in the

well-annotated database Swiss-Prot, 24 enzymes are

associated with diseases in the OMIM database

(Online Mendelian Inheritance in Man) Thus, many

(or even all) of these enzymes are medically important

However, the functions of about half of the human

SDR enzymes are unknown

The SDR superfamily has grown by several orders

of magnitude, from 20-odd members in 1991 [2] to

over 47 000 today, and thanks to the fast progress in

genome and environmental sequencing projects, the

number of known SDR forms can be expected to

increase even more in the future This substantiates the

need for a subdivision into families to achieve a

sys-tematic overview and allow for annotations and for

functional conclusions In this article, we apply hidden

Markov models (HMMs) to obtain a sequence-based

subdivision of the SDR superfamily that allows for

automatic classiﬁcation of novel sequence data and

provides the basis for a nomenclature system

Results and Discussion

The large SDR superfamily is of ancient origin, with

most members being equidistantly related at the 20–

30% residue identity level Consequently, there are no

natural hierarchical relationships to rely on for the

functional assignments HMMs have successfully been

used in protein family characterization [6], and their

use is now standard when new sequences are being

annotated They are used in our functional

categoriza-tion of all SDR members, where each SDR family

cor-responds to an HMM, and this set of resulting HMMs

forms the basis for a sustainable nomenclature scheme

for the whole SDR superfamily

So far, 314 families have been deﬁned, covering

about 31 900 of 47 000 retrieved SDRs

Approxi-mately 9700 SDRs form clusters that are too small

(fewer than 20 members with maximum 80% sequence

identity) for them to be reliably identiﬁed with an

HMM, but these will hopefully be extended as new

genomes become sequenced The remaining SDRs

( 6800) will be investigated henceforth

Family size The numbers of members in the different families identiﬁed vary considerably, but the majority of the families are quite small Over half of the SDR fami-lies have fewer than 60 members, and 77% of the fam-ilies have fewer than 100 members (Fig 1) Large families are rare; there are only 16 families with 400 or more members (Table 1) They are primarily of the

‘extended’ SDR type (nine families), and several of these members metabolize carbohydrate derivatives, a basic function common to most life forms Two such families are the GDP-mannose-4,6-dehydratases (SDR3E in the nomenclature system) and the

GDP-l-fucose synthetases (SDR4E), which are involved

in the two-step conversion of GDP-mannose to GDP-l-fucose The latter is a substrate for several fucosyltransferases, which in turn are involved in the expression of many glyconjugates [7,8] Another exam-ple is provided by the UDP-glucose-4-epimerases (SDR1E), which catalyze the third and ﬁnal step in the Leloir pathway of galactose metabolism, interconvert-ing UDP-galactose and UDP-glucose Impairment of this enzyme reaction leads to epimerase-deﬁciency galactosemia [9], which can lead to, for example, mental retardation or cataract in humans

In 16th place among the largest families are the insect alcohol dehydrogenases (SDR109I) These form

a very specialized group of enzymes, also called the

‘intermediate’ SDR type [10] This seems to be unique

to insects, and the size of this family is due to the very well-studied genomes of fruit ﬂies The 78 different

0 20 40 60 80 100 120

20–39 40–59 60–99

100–199 200–299 300–499

500-Members

Fig 1 SDR family sizes The bars represent the number of SDR families of defined family sizes The most common family size is between 20 and 39 members.

Trang 3

Drosophila genomes alone account for more than half

of the members (228 of 404)

Human SDR members

Of the 314 families identiﬁed, 37 have human

mem-bers Ten additional human SDR families have been

identiﬁed, but these have too few members (< 20

with < 80% pairwise residue identity) to be suitable

for HMM analysis at present In total, the 47

fami-lies represent 82 SDR genes (Table 2) In Table 3,

the numbers of genes for each SDR family in

human, mouse and rat are compared For most

fam-ilies, the numbers are identical for the three species

Of the 13 families with two or more genes in

humans, nine have at least two genes also in mouse

and⁄ or rat

Regarding family size in relation to number of

human members, one would imagine a linear

correla-tion: the larger the size, the more human members

However, typically, the SDR families with more than

400 members have only a single human representative

and, in total, as many as 34 of the 47 families have

only one human member There are four families, with

retinol and steroid dehydrogenases, that stand out:

SDR7C, SDR11E and SDR16C, with their six human

members each, and SDR9C, with as many as eight

human members This observation emphasizes the

critical importance of enzymatic control of retinoid

and steroid metabolism in development as well as

metabolic and homeostatic signaling [11] In this

context, control of ligand access, by oxidoreductases such as the above-mentioned SDRs, to nuclear hor-mone receptors such as retinoid or steroid receptors appears to constitute an important determinant of ste-roid, retinoid and lipid signaling, and seems to necessi-tate the existence of such diversiﬁed enzyme forms to maintain and execute proper functions in multicellular eukaryotes and mammals

Species distribution

A closer look at the distribution among the classiﬁed SDRs in the domains Eukaryota, Bacteria and Archaea (Fig 2) reveals that more than half of the families are unique to bacteria (178 of 314) The two largest of these, SDR56X with 557 members and SDR61X with

245 members, have multidomain enzymes in the form

of polyketide synthases They typically have two NADP(H)-binding domains, and the two SDR families cover one domain each

One-ﬁfth (63) of the families have members among both bacteria and eukaryotes but not among archaea-ons Archaeal SDRs are quite rare, only 32 families having any such member (Fig 2) There are three fam-ilies with much higher proportions of archaeal mem-bers than are generally found Two of them comprise extended SDRs (SDR136E UDP-glucose-4-epimerase and SDR144E UDP-glucose homolog), and the third family contains a classical SDR (SDR146C 3-oxoacyl reductase) There is no single SDR family unique to archaeaons

Table 1 The 16 largest SDR families The largest families, with currently more than 400 members, are listed In the domain columns, the letters E, B and A denote eukaryotic, bacterial and archaeal genomes, respectively.

Domain

Trang 4

Family designation

Family size

Accession number

Trang 5

Family designation

Family size

Accession number

Trang 6

There are few families with representatives from all three domains, only 14 of 314 (Table 4) Seven of them have mammal (and human) representatives Nine families are of the extended type, which is more than expected by chance, as the classical type is most com-mon ( 69%) Therefore, it seems that families of the extended SDR type are represented in more species than the classical type, in agreement with early genome investigations [12] Interestingly, there is one family (SDR53C, related to glucose dehydrogenases) with only 38 members that still has members from all domains, in spite of its small size Typically, the bacte-rial members form the vast majority (80% or more) in most SDR families identiﬁed, in agreement with the fact that 79% of the SDRs are from that domain However, in two of these 14 families, the eukaryotic members are in the majority (SDR51C, l-xylulose reductases; and SDR53C, glucose-1-dehydrogenase-related proteins)

There are 41 families with only eukaryotic members (Table 5) Around half of them are unique to one group of species; insect alcohol dehydrogenases consti-tute one such family, there are seven families unique to plants, and as many as 15 families are unique to fungi Sixteen of the remaining families, with members from multiple groups of species, have mammalian (and human) representatives These families include several

of the steroid dehydrogenases⁄ reductases and carbonyl and fatty acyl reductases

SDR types SDRs are divided into the types ‘classical’ and

‘extended’ [10], and it was previously noted that classi-cal SDRs are more common; however, among SDRs that are present in all eukaryotes, the extended type is equally common [13] Now we are able to make a large-scale comparison, including not only eukaryotes but also prokaryotes Of the 314 families identiﬁed, there are 218 families judged to be classical and 52 extended (68% and 17%, respectively) In total, these cover about 27 900 proteins, and surprisingly, given that the majority of the families are classical, 36% of the proteins are of the extended type This means that many of the largest families are of the extended type Classical SDRs are in the vast majority in families with members from only one domain (Eukaryota or Bacte-ria) and also in families with both eukaryotic and bac-terial members When archaeal members are involved, however, the pattern changes considerably Among the

14 families with members from all three domains, only ﬁve are classical; that is, the extended type is in the majority One reason for this could be that the

Family designation

Family size

Accession number

Trang 7

extended SDRs are typically involved in basic

meta-bolic functions, and thus have a lower tendency to

vary Classical SDRs, on the other hand, are involved

in many different types of enzyme reactions, and are

thus more diverse [1]

Family correlation with functional annotations Among the identiﬁed families, only one-third of the proteins have informative annotations; the other two-thirds have terms such as putative, hypothetical, or

Table 3 SDR members in human, mouse, and rat.

Genes per family

Trang 8

possible⁄ probable, or are only identiﬁed as SDR

pro-teins One advantage of the present family grouping is

that functional annotations can be concluded from

other family members, as many families have at least a

few members with annotations describing their

func-tions In order to investigate whether family functions

could be derived, annotations for each member were

compared within the families In families where the

described functions are quite general, we ﬁnd no

inconsistencies; that is, the annotations (if present) are

the same for every member in a family, thus

support-ing the present classiﬁcation

We also ﬁnd a good correlation between the present

family classiﬁcation and the function(s) among families

for which a more detailed functional role is known,

predominantly families with human and mammalian members In families with a single human member (34 families), there are no contradictory functions anno-tated; that is, for all members with known function, the annotation is the same There are some members that might have another function, according to the protein description, but the function seems to be derived rather than actually determined For instance,

in family SDR6E, there are two members described as GDP-mannose-4,6-dehydratase and 3b-hydroxysteroid dehydrogenase⁄ isomerase (accession numbers Q00VJ3 and A0ZGH3), respectively, suggesting that they belong to families SDR3E and SDR11E instead, but there are no experimental data, and pairwise sequence comparisons clearly identify the human representative for SDR6E (UXS1_HUMAN) as the closest human ortholog

The 13 families with multiple human members typi-cally contain one or several 17b-hydroxysteroid dehy-drogenases (17b-HSDs) (see, for example, [14,15] for overviews of functions) These types of enzyme have mixed origins; one of them (type 5) is not even an SDR, but belongs to the AKR family, and phyloge-netic studies have shown that 17b-HSD activity has evolved from different ancestors, e.g in types 1, 2 and

3 (corresponding to families SDR28C, SDR9C and SDR12C, respectively; see [16] and references therein) These studies also provide support for the inclusion

of retinol dehydrogenases, an 11b-hydroxysteroid dehydrogenase and 17b-HSDs in the SDR9C family,

as they most probably have a common ancestor among

B

E

41

178 0

63 0

18 14

A

Fig 2 Number of SDR families with members representing one,

two or three of the domains of life The numbers represent the

numbers of families with members in different combinations of the

three domains Eukaryota (E), Bacteria (B) and Archaea (A).

Table 4 The 14 SDR families present in all domains The average ratio column shows the average number of members per species The letters E, B and A denote eukaryotic, bacterial and archaeal genomes, respectively The numbers represent percentage of members from each domain Families with human occurrences are indicated by bold type in the eukaryotic column.

Percentage in domain

Trang 9

Table 5 SDR families unique to eukaryotes The average ratio column shows the average number of members per species M, Fi, I, P, Fu and O denote mammals, fish, insects, plants, fungi and other eukaryotes, respectively.

Family name

Family designation

Family size

Number

of species

Average ratio

Species group

NADP(H)-dependent 17b-hydroxysteroid

dehydrogenase (SDR12C)

Multisubstrate SDR9C with preference

for NAD(H)

NADPH-dependent methylglyoxal reductase

GRE2

NADH dehydrogenase (ubiquinone) 1a

subcomplex 9

Fatty acid synthase a subunit FasA,

3-oxoacyl-(acyl-carrier protein) domain

NAD(H)-dependent 17b-hydroxysteroid

dehydrogenase (SDR30C)

putative

protein

SDR130C

Putative short-chain type alcohol dehydrogenase

SDR132C

family

Trang 10

the invertebrates Thus, the family classiﬁcation seems

to be valid also for these families

For every member in the ﬁrst 47 families, we also

made sequence comparisons with all identiﬁed human

SDRs In every family except two, all of the members

have the largest number of identities with the human

representatives in their own family The ﬁrst exception

is the retinol dehydrogenase family SDR7C, where we

ﬁnd a total of 41 members (of 262) scoring higher

towards human SDR41C1 (WWOX_HUMAN) than

any of its own human representatives During HMM

training, some overlaps were found between these

clus-ters, but as we were unable to create a single HMM

that captured every member in the two clusters, it was

decided to keep them separate for now It is possible

that these two families have the same ancient origin,

and hence should be one family instead The other

exception is SDR8C, comprising peroxisomal

multi-functional enzymes, where 25 members (of 360)

pre-ferred SDR30C1 (DHB8_HUMAN) This is in spite

the fact that HMM iteration with these proteins as

seed led to the inclusion of human SDR8C1 and not

SDR30C1 The SDR8C family is primarily involved in

fatty acid metabolism, and has a multidomain

struc-ture, with an N-terminal SDR domain followed by a

hydratase domain, and ﬁnally a sterol carrier protein 2

domain Members of the SDR30C family consist of a

single SDR domain; the exact function has yet to be

discovered, but both fatty acid and steroid metabolism

have been suggested Thus, with the knowledge

avail-able today, it is not possible to evaluate these families

further

Other SDR classifications

As mentioned in Experimental procedures, Pfam [6]

identiﬁes SDRs through the three proﬁles Adh_short

(PF00106), Epimerase (PF01370) and 3Beta_HSD

(PF01073), thus classifying these proteins at a much

more general level, which does not allow the more

ﬁne-grained analysis regarding the presently identiﬁed

families Identifying members of the SDR superfamily

is, of course, a necessary step in order to be able to

cluster and divide them further, but does not give us

insights into the function of a speciﬁc protein, owing

to the large variation in functionality among SDRs

Also, the general HMMs might not correctly identify

sequence fragments, which more specialized HMMs

can About 1600 SDRs were identiﬁed in this way, i.e

not found by the general SDR HMMs but by the

family HMMs only

Another approach uses evolutionary trees [17] to

achieve subfamily classiﬁcation In comparison with

the method presented herein, this approach arrives at much more ﬁne-grained families; for example, our 3b-hydroxysteroid dehydrogenase family is split into eight subfamilies, and our family with retinol

dehydrogenas-es is split into as many as 19 subfamilidehydrogenas-es A classiﬁca-tion system that is too speciﬁc would be impractical,

as it would not provide a correct overview of the divergent SDR superfamily Furthermore, functional conclusions drawn from family members would be of less practical value with smaller families, owing to limi-tations in annolimi-tations

Our HMM system as basis for nomenclature The presently characterized SDR families form a natu-ral foundation for a nomenclature system We have therefore, together with a number of researchers in the SDR ﬁeld, created such a nomenclature system [18], which is already in use [19] This nomenclature will help us to keep track of the different SDR families, and facilitate collection of knowledge on the structural and functional properties of one of the largest protein families known to date

Experimental procedures

A number of HMMs were developed in order to arrive at

a subclassiﬁcation of the SDR superfamily There are already HMMs (three Pfam HMMs and an HMM previ-ously developed by us) for the identiﬁcation of new SDR members in general The purpose of the HMMs now developed is to divide the SDRs into more manageable subfamilies with a more specialized function in common

HMMs were developed using an iterative approach to arrive at stable HMMs that correctly identiﬁes their own members and disregards members of other SDR families (see below)

SDR proteins were extracted from the Uniprot database [20], human RefSeq [21] and human Ensembl [22] as of 1 October, 2008, using a previously developed HMM [10] and the Pfam [6] proﬁles PF00106, PF01073 and PF01370 This dataset consisted of 47 011 proteins (7905 only by our own method, and 6254 only by Pfam) In addition, 1581 proteins have so far been identiﬁed by the HMMs now developed

In order to identify clusters of SDR families, each of the candidate sequences was compared with all of the other candidates using fasta [23] We tested clustering at various levels, and found that an initial clustering at the 40% level and an opt-score better than 700 were most appropriate, as judged from test cases with SDR enzymes of known func-tion Furthermore, the 40% level has also been shown to

be suitable for other classiﬁcation (nomenclature) systems,

Định dạng
Số trang	12
Dung lượng	189,89 KB