Comparatives study on sequence structure function relationship of human short chain dehydrogenases reductases

3.2.2 Construction of consensus of Human SDR protein family and BLAST search 13 3.3 Phylogenetic tree construction and comparison of consensus sequences 13 3.3.1 Phylogenetic Tree Predi

Trang 1

INTERNATIONAL UNIVERSITY VIETNAM NATIONAL UNIVERSITY HCMC

COMPARATIVE STUDY ON STRUCTURE-FUNCTION RELATIONSHIP OF

SEQUENCE-HUMAN SHORT-CHAIN DEHYDROGENASES/REDUCTASES

A thesis submitted to the School of Biotechnology, International University

in partial fulfillment of the requirements for the degree of

MSc in Biotechnology

Student name: TANG THI NGOC NU – MBT04011

Supervisor: Dr LE THI LY

May/2013

Trang 2

ABSTRACT

The human short-chain dehydrogenases/reductases (SDRs) family has been the subject of many recent studies due to their crucial roles in the human body There are a growing number of single-nucleotide polymorphisms and a variety of heritable metabolic diseases that have been identified from the SDR genome Here, we carried out a phylogenetic analysis of homologous SDR sequences, and subsequently utilized

a series of bio-informatics and comparative analytical methods to investigate the sequence-structure-function relationships within the human SDR family Our findings show that Tyrosine, Serine, and Lysine are not only present in all members of the human SDR family, but are also located in a conserved region of both the SDR protein sequence and structure In contrast, we find a cluster of three residues (Serine-Alanine-Serine, Phenylalanine-Glycine-Valine, Cystein-Serine-Serine, Cystein-Histidine-Serine or Alanine-Alanine-Alanine) that are different in protein sequence and structure and appear to be specific to each group of human SDR family Finally, our analysis of correlated mutations within the human SDR family reveals the occurrence of residues that are distantly located, but seem to be interacting with one another We hypothesize that these long-distance interactions may be an adaptive mechanism that allows members of the human SDR family to cope with a changing environment and differing functional demands over evolutionary time Taken together, our results provide data that will be useful for designing inhibitors targeted at specific groups of human SDRs, such as those that

are known to be metabolically disorders

Key words: multiple sequence alignments, consensus sequence, phylogeny,

mutational variability and correlation

Trang 3

ACKNOWLEDGEMENTS

First and foremost I would like to give my special thankful my university advisor, Dr

Ly Le, for her support and encouragement during the time I was carrying out the thesis, for cheering me up and guiding me through temporary standstills In addition,

I would like to give my great thankful to my advisor Dr Ly Le, for providing me with this interesting topic, and for straighten many question marks concerning the Bioinformatics part

I would also like to thank my best friend, Charlene Mccord Buxan, for taking the time

to read my Master thesis and sharing valuable comments

Last but not least, I would like to give my deeply thankful to my parents, who are always by my side Without my parent’s support, I could not finish successfully my Master

Trang 4

PUBLICATION

Ngoc Nu Tang, Ly Le Comparative Study on 11β Hydroxysteroid dehydrogenase 1

(11βHSD1)” Research Journal of Biotechnology, 2012 Accepted

Ngoc Nu Tang, Jacek Leluk, Ly Le Comparative study on function of Human Short-chain dehydrogenases reductase family” BMC

Sequence-structure-Bioinformatics, 2013 Submitted

SUPERVISOR’S APPROVAL

Dr LE THI LY

Trang 5

THESIS CONTENTS

ABSTRACT iv

ACKNOWLEDGEMENT v

PUBLICATION vi

Ngoc Nu Tang, Ly Le Comparative Study on 11β Hydroxysteroid dehydrogenase 1 (11βHSD1)” Research Journal of Biotechnology, 2012 Accepted vi

Ngoc Nu Tang, Jacek Leluk, Ly Le Comparative study on Sequence-structure-function of Human Short-chain dehydrogenases reductase family” BMC Bioinformatics, 2013 Submitted vi

1 INTRODUCTION 1

1.1 General Introduction about Bioinformatics 1

1.2 General introduction on Human Short-chain dehydrogenases/reducutases (SDR) family 1

1.3 Aims and Objectives 3

2 SEQUENCE DATABAES 4

2.1 Data Collection 4

2.2 Bioinformatic tools 4

3 SEQUENCE ANALYSIS TOOLS 4

3.1 Sequence alignment of human SDR protein 4

3.1.1 Alignment of Pair Sequence 5

3.1.2 Local and global alignment 5

3.1.3 Why Sequence Alignment is performed? 6

3.1.4 Substitution Matrices and Gap Penalties 6

3.1.5 Multiple Sequence Alignment 7

3.1.5.1 ClustalW 9

3.1.5.2 MUSCLE (Multiple Sequence Comparison by Log-Expectation) 9

3.1.5.3 KALIGN 10

3.1.5.4 T-COFFEE (Tree-based Consistency Objective Function of Alignment Evaluation) 11

3.1.5.5 GEISHA 3 11

3.1.6 Multiple sequence alignments of human SDR protein and alignment verification 12

3.2 Consensus sequence construction and BLAST search 12

3.2.1 What is BLAST (Basic Local Alignment Search Tool)? 12

Trang 6

3.2.2 Construction of consensus of Human SDR protein family and BLAST

search 13

3.3 Phylogenetic tree construction and comparison of consensus sequences 13

3.3.1 Phylogenetic Tree Prediction 13

3.3.2 Distance-based Method 15

3.3.3 Character-based Method 15

3.3.3.1 PHYLIP 15

3.3.3.2 SSSSg 15

3.3.4 Human SDR phylogenetic tree and comparison of consensus sequences 16 3.4 Mutational variability of human SDRs 16

Mutational Variability (Talana, Consurf) 16

3.4.1 Consurf 16

3.4.2 Talana 17

3.4.3 Mutational variability of human SDR protein family 17

3.5 Analysis of correlated mutations 18

3.6 Availability of original software generated by authors 18

4 RESULTS AND DISCUSSION 18

4.1 Multiple sequence alignment, consensus sequence generation, and analysis of human SDR specificity 18

4.2 Sequence specificity and interrelationships of the human SDR family 20

4.3 Mutational variability of human SDRs 23

4.4 Correlated mutations within the human SDR family 28

5 CONCLUSION 32

6 REFERENCES 32

7 SUPPLEMENTS 34

Trang 7

LIST OF FIGURES

Figure 1: In the motif, “a” represents for aromatic residues, “c” for charged residues,

“h” for hydrophobic residues, “L” for aliphatic, “p” for polar and “x”: for any residues

In motif TGxxGhLG the aliphatic residues before the last G has replaced the original aromatic residues, and the last motif has been changed from h[KR]xxNGP into

h[KR]xxNxxG 6 Figure 2: Illustration of a local and global alignment [Figure 2.2, [22] 13 Figure 3 : Here A, B and C represents the three highly conserved sequences of the same protein taken from three separate organisms The phylogenetic tree give a view of the substitution that happened during the evolution, when these substitutions evolved from the same ancestor [21] 31 Figure 4: Part of result of multiple sequence alignment of 71 human SDR’s member 42 Figure 5: Completed consensus sequence of 71 human SDR’s members 42 Figure 6: Phylogenetic tree construction by PHYLIP 44 Figure 7: Phylogenetic Tree construction by SSSSg Both the program shown that human SDR family can be phylogenetically grouped into five distinct classes 45 Figure 8: Comparison of the five consensus human SDR sequences 46 Figure 9: The active site (AS), substrate binding sites(BS), and three residues

between AS and one of the BS in 5 human SDR groups identified by Talana 47 Figure 10: The identification of functional regions within group 1 using Consurf and Talana 49 Figure 11: The result of mutational variability (done by Talana) 51 Figure 12: Variability profiles for each of the five groups of human SDRs 53 Figure 13: The location of the conserved and variable residues in the template

structure of group 1 of human SDR was identified by Talana 54

LIST OF TABLES

Table 1: PDB code and name of five representative 38 Table 2: The core residues in five human SDR groups identified by Talana 57 Table 3: The surface residues in five human SDR groups identified by Talana 58 Table 4: The identification of correlated mutation sets and their core and surface characteristics for group 5 59 Table 5: Selected correlated mutations in human SDRs identified by Talana 60

Trang 8

1 INTRODUCTION

1.1 General Introduction about Bioinformatics

Bioinformatics is conceptual biology in terms of molecules (in the sense of physical

chemistry) and applying "informatics techniques" (derived from disciplines such as applied math, computer science and statistics) to understand and organize the

information associated with these molecules, on a large scale In short,

bioinformatics is a management information system for molecular biology and has

many practical applications [1] Bioinformatics was born with the response to

handle the large quantities of biological data, which has increased dramatically [2] For example as of August 2000, the GenBank repository of nucleic acid sequences contained 8,214,000 entries [3] and the SWISS-PROT database of protein sequences contained 88,166 [4] On average, these databases are doubling in size every 15 months [3] Bioinformatics, the subject of the current review, is often defined as the application of computational techniques to understand and organize the information associated with biological macromolecules This unexpected union between the two subjects is largely attributed to the fact that life itself is an information technology;

an organism’s physiology is largely determined by its genes, which at its most basic can be viewed as digital information [1]

Basically, the aims of bioinformatics are three folds:

The first aim of bioinformatics helps to organize the data in an easier way for researchers to access existing information and to submit new entries as they are produced, as the Protein Data Bank for 3D macromolecular structures [5,6] Thus the purpose of bioinformatics extends much further

The second aim of bioinformatics is to develop tools and resources that aid in the analysis of data For example, having sequenced a particular protein, it is of interest

to compare it with previously characterized sequences This need is more than just a simple text-based search and programs such as FASTA [7] and PSI-BLAST [8,9] must consider what comprises a biologically significant match Development of such resources dictates expertise in computational theory as well as a thorough understanding of biology

The third aim of bioinformatics is to use these tools to analyze the data and interpret the results in a biologically meaningful manner More specific, bioinformatics can conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features

According to the important of Bioinformatics contribute in Biology area, especially in analyzing the huge biological data effectively In this study, I applied the third aim

of bioinformatics to highlight the general and specific characteristics of human SDR family by covering two aspects on the bioinformatic’s topics, multiple sequence alignment algorithm and identification of conserved motifs

1.2 General introduction on Human Short-chain dehydrogenases/reducutases (SDR) family

Short-chain dehydrogenases/reductases (SDRs) belong to one of the largest enzyme super-families and includes over 46,000 members [10] Among these, there are at least 140 different enzymes that have been sequenced to date, and about 70 of them are known to belong to the human SDR family [11, 12] Most SDRs are known

to be NAD or NADP-dependent oxidoreductases that share characteristic sequence motifs and mechanisms of action [13, 14] This SDR enzyme super-family is present

in all forms of prokaryotic and eukaryotic life [13], and plays an important role in a variety of key metabolic processes

Indeed, human SDRs have been extensively studied for their critical roles in lipid, amino acid, carbohydrate, cofactor, hormone and xenobiotic metabolism, as well as

in redox sensor mechanisms [15] In addition to their crucial roles in normal

Trang 9

metabolic processes, the function of human SDRs in metabolic defects, such as type

II diabetes, warrants continued research attention [16,] Given their part in proper physiological functioning, the human SDR protein family appears to be a suitable target for the development of novel drugs directed at influencing hormone metabolism [17]

Despite their importance to proper metabolic function and potential use for the treatment and prevention of various human diseases, a standardized way of classifying SDRs has yet to be established According to prior studies, SDR enzymes can be divided into two main types, denoted as “Classical” and “Extended.” [18,19] The “Classical” type consists of about 250 amino acid residues, while the “Extended” family has an additional 100-residue domain forming the C-terminal region Another study, alternatively, divided the family into three types, designated as

“Intermediate”; “Complex” and “Divergent,” which can be distinguished according to their characteristic sequence motif [15] Even with conflicting ideas of how to group SDRs, it is clear that members of the human SDR family have diverged over evolutionary time because they share only 15 to 30% of overall sequence identity [16] Despite clear sequence diversification, human SDRs all have a common sequence motif that defines the cofactor binding site (TGxxxGxG) and the catalytic tetrad (N-S-Y-K) [20] Moreover, the three-dimensional structures of all human SDRs share common features, such as an alpha/beta-folding motif characterized by a central beta-sheet This central beta-sheet is typical of a Rossmann-fold with helices

on either side [20] Given these interesting structural similarities, it is important to study the evolutionary history of the SDR super-family to better understand why they have similar 3D structure despite sharing very little sequence identity It is proposed that these common motifs might be conserved through evolution due to their crucial function in differentiating the human SDR family from other enzyme families [13] While bio-molecule mutations occur at the level of sequence, the effects of these mutations are noticed at the level of function Bio-molecule function,

in turn, is directly related to 3D structure As such, by studying and comparing the sequences and 3D structures of the different human SDRs in a phylogenetic context,

it may be possible to reveal more pertinent information about the evolutionary and functional diversification of the group

The ﬁrst enzymes of this type were analysed as early as in the 70’s These analyses gave the structures of prokaryotic ribitol dehydrogenase and Drosophila

alcohol dehydrogenase The proteins were then not known to be a family but

the alcohol dehydrogenase turned out to differ from the previously known alcohol dehydrogenase of liver and yeast When other dehydrogenases showed the

same distinctive pattern as the two alcohol dehydrogenase types the concept of

a family of short-chain dehydrogenases was established [20] This occurred in

1981, and since then the SDR family has grown enormously, both in the number of known members and the variety of their functions Currently at least

3000 members, including species variants, are known with a substrate spectrum ranging from alcohols, sugars, steroids and aromatic compounds to xenobiotics [19, 20]

As can be expected, due to its broad variety of different functions, the SDR

family is very divergent The residue identity in pair-wise comparisons is as low

as 15-30% However, although few residues are completely conserved,

there are several sequence motifs, consensus patterns, which are distinguishable within the families The criterion for SDR membership is therefore the occurrence of typical sequence motifs, arranged in a specific manner These motifs comprise Rossmann-fold elements for nucleotide binding and specific residues for the active site and they reflect common folding patterns [18]

The SDR enzymes can be divided into two main classes, the Classic and the

Trang 10

Extended families The Classic family is the largest family, with 218 of the

sequences in the data set as opposed to 118 for the Extended family Why are

there then two classes, what distinguish them from each other? One distinction

is the length of the sequences The classical SDRs have a sequence length of

around 250 residues, while the extended SDRs are around 350 residues long

Another difference, although with exceptions, is that the classical SDRs prefer

the NADP(H) coenzyme and the extended SDR prefer the NAD(H) There are

however NAD(H)-binding classical SDRs as well as NADP(H)-binding extended

SDRs [19, 20] The earlier mentioned sequence motifs do also differ between

the two main families These motifs are what is most distinguishing for the two classes, since for example the length can vary The motifs can therefore be used

to separate the two main families

What do then the motifs look like? They are placed in or near to different

secondary structures, as for example the Gly-motif, TGxxxGhG or TGxxGhlG,

which is placed in and adjacent to β1 + α1 There are seven motifs each for the Classical and the Extended families These motifs are based on the motif used

by Bengt Persson et al [20] and can be seen in figure 3 below:

Figure 1: In the motif, “a” represents for aromatic residues, “c” for charged residues, “h” for hydrophobic residues, “L” for aliphatic, “p” for polar and

“x”: for any residues In motif TGxxGhLG the aliphatic residues before the last G has replaced the original aromatic residues, and the last motif has been changed from h[KR]xxNGP into h[KR]xxNxxG

It is not denial the fact that human SDR family play important implications for medicine, especially involving in metabolic defects as diabetes type II Therefore, the identification and functional analysis of human SDR family on sequences and structures is the primary goal of the study leading to new targets for drug design and development

1.3 Aims and Objectives

In this study, a rigorous comparative analysis of homologous sequences and sequence-structure-function relationships in the human SDR family was performed using bioinformatics Our goal was to gain insight into the mechanisms of action of the human SDR family Specifically, we sought to identify and compare the convergent and divergent residues of the human SDR nucleotide-binding pocket

We hypothesized that evolutionarily conserved regions in the human SDR family would appear at or near to the location of the active and binding sites of the protein

Trang 11

This is because active and binding sites are responsible for any chemical and or enzymatic reactions that happened in the protein molecules These interactions help

to maintaining proper functioning at these sites is necessary for the protein-protein and protein-ligand interactions that are indispensable in regulating molecular processes In contrast, we expect to find variable regions in the human SDR family that are near the nucleotide binding sites due to the varying substrate-enzyme interactions that are characteristic of each individual human SDR family Through periods of adaptive radiation over evolutionary time, these regions of variability allowed each group within the human SDR family to adopt its own, specific features These nuances in structural design, and potential functionality, are important to identify in order to facilitate the future design of inhibitors that are directly targeted

to each subgroup of human SDRs [18,19, 20]

Tools for doing multiple sequences alignment:

ClustalW, MUSCLE, Kalign and T-COFFEE were available at European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) (http://www.ebi.ac.uk/services/all)

Geisha3 was available at (http://atama.wnb.uz.zgora.pl/~jleluk/linki.html)

Tools for constructing consensus sequence:

(http://atama.wnb.uz.zgora.pl/~jleluk/linki.html)

Tools for constructing phylogenetic tree:

PHYLIP was available at (http://www.phylip.com)

(http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/ssssg/ssssg.zip)

Tools for studying mutational variability:

Consurf was available at (consurf.tau.ac.il)

Talana was available at (http://www.bioware.republika.pl/)

Tools for visualization the results from the study of Consurf and Talana

Rastop was available at (http://www.geneinfinity.org/rastop)

Tools for studying correlated mutation

Talana was available at (http://www.bioware.republika.pl/)

(http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/corm.jar)

3 SEQUENCE ANALYSIS TOOLS

3.1 Sequence alignment of human SDR protein

Several diseases are caused by disorder in genes or proteins An understanding of the sequences, how the genes or proteins are related to each other, what functions they have can be of help in the development of remedies for diseases that are caused by these disorders However, an enormous task is trying to seek for the gene

or protein that is responsible for the disease This is because, for example one single gene can be responsible for several hundreds or thousand’s of base pair

Trang 12

combinations that could bear the disorder There can also be hundreds of gene sequence candidates for intensive studies Hence, it is hard to choose a good candidate to further investigate if it could be the origin for the disease The appropriate candidate has previously been found with trial and error technique However, the finding for remedy of a disease is a time consuming, thus new techniques and algorithms are needed to make the discovery of the gene or protein that cause a particular disease easier [21]

Depending on the biological data is available on several websites as UniprotKB, NCBI, ect There are several different techniques available for performing the sequencing For example, if we want to highlight the difference and similar of two protein sequences then, alignment of pair sequences will be the best method to carry out In contrast, if we want to generate the similar and different characteristics for all members in a certain family then, multiple sequence alignment will be suitable tools

3.1.1 Alignment of Pair Sequence

A pair-wise sequence alignment is performed when two protein sequences are available in the databases (UniProKB or NCBI), then a comparison will be made for a series of characters or character patterns that lies in the same order in both the sequences For example:

The two sequences, V and W, are written in a two-row matrix The first row contains the characters of V and the second the characters of W Matching characters, in V and W, are placed in the same column and different characters are placed as a mismatch in the same column Another way to deal with different characters is to insert a gap into the sequence, such that the character is placed opposite a gap in the other sequence The reason for introducing a gap is that more matches can be achieved in this way The column score is usually positive for corresponding characters and negative for dissimilar characters The sum of all column scores makes up the score for the alignment [22]

3.1.2 Local and global alignment

A pair-wise alignment can either be performed as a global or a local alignment, as shown in Figure 1 below

The global method was firstly to be invented and was used to compare over their entire length (Figure 1) The global alignment works properly when the sequences are conserved or when they are closely related to each other

In contrast, the local method works well when the sequences are not closely related

so, they might only share a conserved region and thus, they are not similar over the whole sequence In local alignment, only substrings in the sequences are aligned [22]

Therefore, depending on the level of similarity in sequences, we can determine whether local or global alignments will be applied

Trang 13

Figure 2: Illustration of a local and global alignment [Figure 2.2, [22]

3.1.3 Why Sequence Alignment is performed?

Sequence alignments are important for discovering functional, structural, and evolutionary information in biological sequences With the help of an alignment, it is simple for us to illustrate whether protein sequences share similar functionality biochemical function and 3D structure If proteins from different life form are similar, they might have diverged from a common ancestor Hence, they are homologous and they suffered the mechanism of mutation and selection to evolve into the new sequences The alignment also implies the changes that have occurred between the sequences and their common ancestor, are considered as substitutions If there have been insertions of new and deletions of old residues from the sequences, this is referred to as gaps [23] Therefore, the more similarity the sequences are, the less change have occurred and the protein are likely related, thus the best alignment is the one that best represents the most likely evolutionary scenario

It is however important to remember that even though the sequences are similar; they might not necessarily be homologous Similarity between short sequence fragments may have evolved by chance or as a result of evolutionary convergence, meaning that the similar regions have the same function but that they have developed independently from different ancestors [24] This is still a limitation on sequence alignment In order to overcome that problem, in this study, I carried out five independent approaches of alignment in order to increase the accuracy of the alignment [24]

3.1.4 Substitution Matrices and Gap Penalties

As mentioned earlier, substitutions happen during evolution When this happens, certain amino acids are more commonly changed than others because they share the similar in physio-chemical properties, other changes take place too, however they are rarer Knowing which substitutions are the most and least regular in a large number of proteins can aid the prediction of alignments for any set of protein sequences [21] Matrices that estimate the probabilities of all possible substitutions can therefore be of use here There are several different methods for building these

so called substitution matrices, but the two most commonly used are PAM and BLOSUM PAM is for example applied in the much used global-multiple alignment program CLUSTALW [22]

Aside from binary cost functions (0: match and 1: mismatch) a transformation matrix of substitution costs can be instituted which will assign a separate penalty for each class of mismatch observe [23]

Trang 14

The minimum mutation distant matrix (Fitch, 1990) is based on the minimum number of nucleic acid/amino acid which must be changed in order to convert the codon for 1 amino acid to the codon of another amino acid The most common type

of transformation table is the log-odds matrix These log-odds matrices contain the relative frequencies with which amino acid are assumed to replace 1 another overtime The positive values in the matrix indicate a replacement rate is greater than expected by chance whereas negative values indicate a replacement rate is less than expected by chance

The most relevant of log-odds matrices are the PAM (point allowed mutation) matrix (Dayhoff et al., 1978) PAM matrix is calculated from the original PAM1 by multiplying the PAM1 matrix y X times with itself then giving the probability of X PAM

1 mutation Low PAM matrix is used with closely related sequences, while high PAM matrix is used with distantly related sequences The other matrix is BLOSUM matrix (Henikoff, 1992) which is based on well-conserved blocks of multiply aligned sequences segments or motif that represented the most conserved regions of aligned family

As earlier explained, there might contain gaps in an alignment These gaps are introduced into the alignment in order to align as many of the same characters as possible There should however not be to many gaps, because if gaps appear everywhere the alignment will show an unlikely change of amino acids [24, 25] For this reason there exist penalties for inserting gaps There is one penalty for opening the gap and one penalty for extending the gap There are several ways to decide the value of the penalties, but the gap extension penalty is usually set to something less than the gap-open penalty, allowing long insertions and deletions to

be penalized less than they would be by the linear gap cost This is desirable when gaps of a few residues are expected almost as frequently as gaps of a single residue [23]

In addition, gap are constructed in the alignment representing implied insertion or deletions The decision to institute a gap in the alignment is a result of the gap cost calculation during the wave-front update of the matrix elements

The first method is the dot matrix analysis This method shows a matrix of the two sequences It has one sequence written horizontally across the top of the graph and the other along the left-hand side, and diagonal lines showing alignments [21] The second method is the dynamic programming algorithm, which solves a problem

by combining solutions to sub-problems It finds the optimal alignment by comparing all character-pairs in the sequences [24] This algorithm is a commonly used algorithm for sequence analysis

The third method is the word or K-tuple method This method starts by searching for identical short stretches of sequences, called words or k-tuples These are then joined into an alignment with the dynamic programming algorithm Examples of programs using this method are FASTA and BLAST These programs are commonly used for database searches, when seeking for the sequences that align the best with

an input test sequence [25]

This was a few of the most widely used methods for pair-wise sequence alignment If the sequences to be aligned, though, are more than two then these methods are not

a good choice The next chapter, Multiple Sequence Alignment, will discuss other methods that are more appropriate in this case

3.1.5 Multiple Sequence Alignment

Similar to a pair-wise alignment, multiple sequence alignment is that sequences are searched for a series of individual characters or character patterns that are in the same order in the sequences The alignment can be also performed as a global or a

Trang 15

local alignment, and substitution matrices and gap penalties can be of use The difference is that a multiple sequence alignment contains more than two sequences

A multiple sequence alignment of a set of sequences can provide information on the most alike regions in the set In proteins, such regions may represent conserved functional or structural domains

Multiple alignments are the basis for most sensitive sequence searching algorithms They are also useful for deciphering evolutionary information in biological sequences For example, they provide information on which residues are important for the function and for stabilization the secondary and three-dimensional structures of the protein That is, it can illustrate which residues and regions that represent conserved functional or structural domains Even if only two sequences in a set are supposed to

be aligned, it can be meaningful to conduct a multiple alignment of all sequences in the set in order to improve the accuracy of the alignment In addition, it is difficult to identify the pattern of conserved residues when only comparing the two sequences Therefore, multiple sequence alignment is the most common methods to study the conserved motif for a certain family [26]

Standard Protocol of multiple sequence alignment (MSA) in traditional way

 Pair-wise alignment (align the most closely related sequence first and then gradually adding the more distant one)

 A distant matrix (K-tuple) calculated based on pair-wise alignment (based on level of identity) then, giving the divergence of each sequence

 A guide tree is calculated based on distant matrix by applying Neighbor joining

 The sequence progressively aligned according to the branches order in the guide tree (based on the statistic matrices, PAM/BLOSUM) then, doing complete alignment [22, 23]

However, there are two major problems associated with the progressive approach on MSA of traditional methods That is the local minimum problems and the choice of alignment parameters

The local minimum related to the “greedy” nature of the alignment strategy The algorithm greedily adds sequences together, following the initial tree There is no guarantee that global optimal solution, as defined by some overall measure of multiple alignment quality or anything related to it More specifically, any mismatches made early in the alignment process can not be corrected later as new information form other sequences is added This problem mainly related to the result

of incorrect branching order in the initial tree The initial tree are derived from a matrix of distances between separately aligned pairs of sequences and are much less reliable compared to the trees from completed multiple alignment Thus, if misalignment happen and carry through from the early alignment steps cause local minimum problem

The problem in choosing alignment parameters happen due to traditionally, one weight matrix and two gap penalties (one for opening a new gap and one for extending existing gap) work well for closely related sequence Because all residues weight matrices give most weight to identities When identities dominate an alignment, almost any weight matrix will find approximately the correct solution However, this method does not work well for most distantly related sequences or divergent sequences then, leading to more mismatches In addition, in closely related sequences, the range of gap penalties values will find the correct solution that can be very broad However, as more and more divergent sequences are used

It may be very narrow range of values will deliver the best alignment

Therefore, there are many MSA methods are designed to improve these problems which are available in the EBI website at the current time However, the most

Trang 16

important point is which MSA tools will be the best candidate when performing MSA? The answer is there will be no the best tools but the most crucial point to consider when choosing a program will be the biological accuracy, execution time and memory usage The most accurate programs according to benchmark [16,17] tests are MUSCLE and T-COFFEE In practice, accuracy claims can be difficult to validate due to the frequent practice of parameter tuning to optimize performance on 1 or more benchmarks Benchmark scores are typically based on averages over many alignments [27] Thus, we employed four independent tools that are available at EBI services as ClustalW, Kalign, MUSCLE and T-Coffee to perform MSA However, these four tools work based on Hidden Markov Model (HMM) which assumes that the probability of amino acid A is substituted by amino acid B is independently to what amino acid A is transformed from This model contains some limitations that amino acids can not be considered as equal unit in evolutionary substitution [20, 28] Because some of them are encoded by 6 codons as Serine, Leucine and Arginine while some of them are encoded by only 1 codon as Methionine In order to overcome that limitation on HMM, we applied one kind of newest model, called Genetic Semi-homology implemented in Geisha 3 This model focuses on what kind

of codons that encoded for amino acid so, it concerns on cryptic mutations which changing in gene compositions without affecting on amino acid sequence Hence, genetic-semi-homology is more sensitive than HHMs to work on non-homologous sequences [28, 36] Therefore, in order to improve the accuracy of MSA:

In this study, the five independent tools of MSA already applied to address these issues in their programs [29]

3.1.5.1 ClustalW

In ClustalW, there were several improvements on the progressive multiple alignment method which greatly improve the sensitivity without sacrificing any of the speed and efficiency which makes this approach so practical

In ClustalW, the problem in the choice of alignment parameters will be improved by varying the gap penalties in a position and residue-specific manner [19]

All pairs of sequences are aligned separately by using ClustalW two groups of penalties and full amino acids weight matrix are used in dynamic programming (using matrix to score alignment)

Guide tree created based on score by using Neighbor Joining In CLustalW, all of the remaining modifications apply on to the final progressive alignment stages:

 Initially, gap penalties are calculated depending on the weight matrix (similarity-length of sequence)

 Derive sensible local gap open penalties at every position in each pre-aligned group of sequence will vary as new sequence is added

 The final modification allows us to delay the addition of very divergent sequence until the end of the alignment process when all of the more closely related sequences have been aligned

Initial values can be set by users Then, the software automatically attempts to choose appropriate gap penalties for each sequence alignment, depending on several factors: the weight matrix, similarity of sequence, length of sequence, differences in length of sequence, position-specific gap penalties [30]

3.1.5.2 MUSCLE (Multiple Sequence Comparison by Log-Expectation)

Stage 1: Draft progressive

Trang 17

All unaligned sequences are used to align first by using Kmer counting (K-tuple-word matching) to create K-mer distance matrix D1 Based on the distance matrix D1, the tree 1 is calculated by applying UPGMA (Un-weight pair group method with Arithmetic Averages) and distance matrices are clustered using UPGMA In tree 1, internal node-a pair-wise alignment is constructed to create a new profile At each leaf, a profile is constructed from input sequences Nodes in the tree are visited in prefix order-children before their parents) Next, the progressive alignment is calculated based on the tree 1 and multiple sequence alignment 1 is produced Stage 2: Improve progressive

Compute percentage identity in multiple sequence alignment 1 and Kimura distance matrix D2 is produced Then, the tree 2 is produced based on UPGMA This is optimized by computing alignment only for sub-trees whose branching order changed relative to tree 1 Finally, the progressive alignment is carried out to produce multiple sequence alignment 2

Stage 3: Refinement

Based on the tree 2, deleted edge in tree 2 to produce two sub-trees, then, computing sub-tree profiles The sub-tree profiles are used to do realignment profiles and multiple sequence alignment is then finally produced Finally, sum of pairs score

is used to confirm the accuracy of multiple sequence alignment If the sum of pairs score give score better then, the final multiple sequence alignment is produced [20]

an edit distance d if A can be transformed into B by applying d mismatches (insertion/deletion) then, providing distant scores to build up the tree

In Manber, sharing mismatches patterns can be still readily found on enable Manber algorithm to report the meaningful distances between highly divergences However, for matching patterns, many spurious (failed but seemly tree) matches are reported

Wu-Wu-Manber algorithm also applies in progressive alignment At each internal node of the guide tree, two profiles are aligned Optionally, KALIGN uses Wu-Manber as an anchor point during the alignment phases, which requires two extra steps to dynamic programming KALIGN employs global dynamic programming method-using affine gap penalties mean that residues are assigned into three stages (aligned, gap in sequence A and gap in sequence B) It disallows a gap in one sequence to be immediately followed by a gap in other sequence When these state matrices are filled, the final cells contain the maximum align score and a trace back procedure (requiring the matrices) is used to retrieve the actual alignment

There are two extra steps in dynamic programming:

 Consistency check: this task is sieve through thousand of matches found between two sequences Find the largest set of matches that can be included

in an alignment

 Updating of pattern match positions: this updating step adjusts the absolute position of matches found within sequences to their relative position within profiles generated by dynamic programming step

KALIGN uses a substitution matches (BLOSUM, PAM) an affine gap penalties in dynamic programming A common idea is that similar sequences should be aligned with hard matrices (PAM50, BLOSUM80) while more distantly related sequences align better using soft matrices (PAM250, BLOSUM40) [31]

Trang 18

3.1.5.4 T-COFFEE (Tree-based Consistency Objective Function of Alignment

Evaluation)

This method has two main features: provide a simple and flexible mean of generating MSA using heterogeneous source (combing local and global sequence alignment) In addition, the optimization method is used to find the MA that best fit the pair-wise alignment in the input library)

First, the ClustalW primary library is created by doing the global pair-wise alignment and the Lalign primary library is created by doing the local pair-wise alignment Then, the next step is to combine the local and global alignment by addition If any pairs are duplicated between two libraries, it is merged into a single entry that has a weight equal to the sum of two weights Otherwise, a new type of entry is created for pair being considered

The second step is weighting or signal addition T-Coffee assigns each weight to each pair of aligned residues in the library

The thirds step is primary library or listing of weight pair-wise constraints Each constraint receives a weight equal to percent of identity within the pair-wise alignment it comes from

The fourth step is extension For each pair of aligned residues in the library, T-Coffee assigns a weight that reflects the degree to which those residues align consistency with residue from all others

The fifth step is extension library The final weight for any pairs of residues reflects some of the information contained in the whole family It is based on taking each aligned residue pair from the library and checking the sequences Thus, the weight of

a pair of residues will be the sum of all the weights gathers through the examination

of all the triplets involving this pair

The final step is progressive alignment To replace BLOSUM/PAM by using weight in the extended library to align the residues in two sequences This pair of sequences is then fixed and any gaps have been introduced can not be shifted later The next closet two sequences are aligned to the existing alignment of the first two sequences Finally, completed MSA is created [32]

on amino acid sequence, single point mutation-common mechanism for protein variability For example:

Met(AUG) is changed to Arg(AGG) and then changed to Lys(AAG) If Arg is originated from Met, need one step to change Arg to Lys However, Leu(CUR) is changed to Arg(CGR) and then changed to Glu(CAR) and finally to Lys(AAR) In this case, if Arg

is originated from Leu, need two steps to change Arg to Lys

Thus, ClustalW, KALIGN, T-COFFEE and MUSCLE do not concern on genetic code In order to overcome this limitation, genetic semihomology implemented in Geisha3 is used in our study to increase the accuracy of the alignment and it also considers on the closely relationship between amino acids and their codons in related proteins For instance:

Trang 19

Met(AUG) is transferred into Arg(AGG) and then transferring into Lys(AAG) All this steps happened based on the one single point mutation or single step from U to G and to A However, id the Arg in this example is coded with CGU then, in order to mutate to Lys (AAU), this process does not follow one single transition or tranversion anymore Thus, Geisha3 in our study not only helps to increase the accuracy of the alignment but also helps to reduce the mismatches happened in the alignment process

In Geisha3, the term Semi-homoloy is used which means that two residues are Semi-homology if there is only one substitution in their codons Thus, there are three different types of Semi-homology:

The first type of Semi-homology concerns amino acids whose codons differ in one nucleotide of the same type such as pyrimidine (T and C) to pyrimidine, purine (A, G) to purine

The second type of Semi-homology concerns amino acids whose codons differ in nucleotide of different types such as pyrimidine to purine

The last type of Semi-homology is not alternative to the former two It concerns residues whose codons differ in the last codon which is known the most tolerant in encoding amino acids [33, 34, 35]

3.1.6 Multiple sequence alignments of human SDR protein and

alignment verification

75 sequences of human SDR enzymes were collected from UniProtKB database (http://www.uniprot.org) Sequences were initially aligned with ClustalW, T-Coffee, MUSCLE and Kalign using the template sequence Q14376 (UDP-glucose-4-epimerase) In order to create the most robust alignment possible, initial alignments using each method were compared against one another and the most differing sequences, with a very low degree of shared identity, were removed before performing subsequent analyses

The potential evolutionary relationship between corresponding non-identical positions from the four different multiple alignments were verified separately using the genetic semi-homology algorithm implemented in version 3 of the program Geisha [33,34,35] Geisha3 is freely accessible from the Website (http://atama.wnb.uz.zgora.pl/~jleluk/linki.html) Verifying multiple sequence alignments using Geisha helps to identify and reduce potential mismatches that may occur during the initial alignment process ClustalW, T-Coffee, MUSCLE and Kalign are based on the Hidden Markov Model Geisha improves alignment accuracy by completing the alignment while considering point mutations Setting it apart from the programs used for initial alignments, Geisha assumes that the probability of the replacement of one amino acid into another depends significantly on what amino

acids occupied that position in the past

Only the sequences who displayed the most similar level of identity (equal or higher than 80% in that case) would be keep in the result of MSA otherwise would be removed Because these sequences would be target for constructing consensus sequence which shows the most conserved motifs for human SDR family

3.2 Consensus sequence construction and BLAST search

3.2.1 What is BLAST (Basic Local Alignment Search Tool)?

The most widely software for efficiency comparing bio-sequences to a database is BLAST [26] BLAST computation is organized as thee steps pipeline:

Stage 1: Words matching, which detects substring of fixed length w in the stream that perfectly match a substring of query

Trang 20

Stage 2: Ungapped extension, each matching w-mers is forwarded to the second stage, ungapped extension which extends the w-mers to either side to identify a longer pairs of sequences around it that match with at most a small number of mismatch character These longer matches are high-scoring segment pairs (HSPs) or ungapped extension

Stage 3: Gap extension Every HSPs has both enough matches and sufficiency few mismatches is passed to the stage of gap extension The gap extension use the Smith-waterman dynamic programming algorithm to extend it into gapped alignment, a pair of similar regions that may differ by arbitrary edit [37, 38]

In this study, we apply NCBI BLAST [39] for searching homologous sequences of human SDR family

3.2.2 Construction of consensus of Human SDR protein family and

BLAST search

As a way of summarizing the verified human SDR multiple sequence alignments, a single consensus sequence for the entire human SDR super-family was established The consensus sequence was obtained using the Consensus Sequence Constructor [33,34,35] with default parameter values The highly conserved positions (>70% identity) are marked with bolded black letters, whereas Intermediate conservation (>30% identity) is indicated with black characters corresponding to the most commonly occurring residue and the positions marked as X are the variable positions that are occupied by any particular residue in more than 30% of sequences

This is an original application designed by our Polish collaborators and is freely available for non-commercial academic purposes from the Website

http://atama.wnb.uz.zgora.pl/~jleluk/linki.html The most robust consensus sequence was then used to identify two types of specificity for all members of the

human SDR super-family: 1) the general specificity, which indicates common features of the entire enzyme super-family, and 2) the individual specificity, which

distinguishes the unique structural properties of each grouping within the human SDR super-family separately Put another way, the general specificity is concerned with the more conservative regions of the human SDR protein sequence, while the individual specificity highlights the more variable regions By investigating both types

of specificity, our results may be of better use for future work on developing inhibitors that can be directed to only one or a few enzymes without affecting the activity of others Lastly, the consensus sequence was also used in a BLAST search for potential new members of the human SDR family The new sequences supplemented the original 75 SDR family members (about 100 additional sequences) and were aligned in the same way as described above

3.3 Phylogenetic tree construction and comparison of consensus sequences

3.3.1 Phylogenetic Tree Prediction

Phylogenetic tree shows the inferred evolutionary relationships among various biological species or other entities based upon on similarities and the differences in their physical and for genetic characteristics The taxa joined together in the tree are implied to have descended from a common ancestor [25]

The phylogenetic tree prediction is used for structuring sequences constitutes an important area of sequence analysis It can be helpful when analyzing changes that have occurred in the evolution of different organisms, or it can be of use when studying the evolution of a family of sequences Based on these analyses the sequences that are the most closely related can be identified through that they are occupying neighbor branches on a tree

When a phylogenetic analysis of a family of related nucleic acids or protein

Trang 21

sequences is performed the evolutionary history of the family is examined and

the sequences are shown in the form of an evolutionary tree The original ancestor sequence will then form the root node of the tree The branching relation in the tree shows the degree to which the sequences are related The closest related sequences will be placed as neighbor-leaves and are joined to a common branch beneath them Phylogenetic analysis is closely related to multiple alignment, which often is

the base that the phylogenetic analysis proceeds from One reason for building

a phylogenetic tree of the multiple alignment is that the tree makes the relationships between the sequences clearer Another reason is that when the genes for the proteins, in the different organisms, have developed during evolution, amino acids have been substituted A phylogenetic tree can be of use when these substitutions are to be analyzed [21] An illustration of a small phylogenetic

tree with a few substitutions is given in Figure 3

Figure 3 : Here A, B and C represents the three highly conserved sequences

of the same protein taken from three separate organisms The phylogenetic tree give a view of the substitution that happened during the evolution, when these substitutions evolved from the same ancestor [21]

How is then the phylogenetic analysis performed? First a multiple alignment

is built, using one of the methods described in the chapter concerning multiple

alignment, as for example the CLUSTALW program Then the substitution

model is chosen The choice of model is based on how similar the sequences are

If they are highly similar the PAM matrix is often useful, since it is designed to

track the evolutionary origins of proteins, but if they are less similar the BLO-

SUM matrix might be superior, because it is designed to ﬁnd the conserved

domains of proteins [21] After choosing the substitution model the next

step is to build the tree Here there are several tree-building methods to choose

of These methods can be divided into two main groups, namely distance-based and character-based methods [25]

These two groups are just briefly explained below, because phylogenetic prediction was not considered as a solution for the classification problem subjected for this project The reason for this is that the SDR proteins are distantly related A phylogenetic analysis of very different sequences is difficult to carry out, as there are several possible evolutionary paths that could have given rise to the observed sequence This results in a very complex problem that requires considerable expertise to execute [21]

Trang 22

3.3.2 Distance-based Method

Distance-based methods use the number of changes, the distance, between two aligned sequences to derive trees [22] The sequence pairs that have the least number of changes between them are the closest related They are placed as neighbors in the tree and are both connected to their common ancestor node by a branch [21] There are several different methods that are classed as distance-based methods, for example Un-weighted Pair Group Method with Arithmetic Mean, UPGMA, Neighbor-joining, and Fitch-Margoliash [21, 28]

3.3.3 Character-based Method

The character-based methods derive trees that optimize the distribution of the

actual data patterns for each character Pair-wise distances are therefore not, as

in distance-based methods, fixed as they are determined by the tree topology

This allows the assessment of the reliability of each base position in an alignment on the basis of all other base positions [25] Examples of methods that belong to the character-based methods are Maximum Parsimony and Maximum Likelihood The last method for sequence analysis is secondary structure prediction, which is described in the paragraph below

Although both distant and character-based methods can be used to construct phylogenetic tree but in our study, we prefer to construct tree based on character-based method, especially Maximum Likelihood (ML) and Maximum Parsimony (MP) This is because distant based methods as UPGMA (Unweighted Pair Group with Arithmetic Mean) or Neighbor-Joining (NJ) contain several drawbacks such as: they can work well on closely related sequences but failed on the distantly divergent sequence [28] However, in our study, the level of identity of SDR family is only 15%

to 30% so, in order to produce the most accuracy result, MP and ML can overcome the drawbacks of UPGMA and NJ [28]

For each site, each leaf is labeled with set containing observed nucleotide at this position For each internal node I with children j and k, labeled Si and Sk

For example, given a data D, model M, find a tree T

Pr (D/M, T) is maximized

Make two independent assumptions:

Trang 23

 Different sites evolve independently

 Divergent sequences evolve independently after diverging

3.3.4 Human SDR phylogenetic tree and comparison of consensus

sequences

The results of our multiple sequence alignments were used as input data for constructing phylogenetic trees that would outline the interrelationships of the various members of the human SDR super-family In this study, two independent approaches were used to construct the phylogenetic trees - PHYLIP (Felsenstein, 1989) and SSSSg (database: Uniprot, matrix: BLOSUM45, number of matches: 10 and E upper value: 5.0) PHYLIP is a free package of programs for inferring phylogenies accessible at (http://www.phylip.com) SSSSg is our original software,

(http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/ssssg/ssssg.zip) PHYLIP uses Fitch’s maximum parsimony algorithm, and constructs the phylogenetic tree that requires the least amount of evolutionary change to fit the input data To supplement our parsimony analyses, we also applied the maximum likelihood algorithm to our data using the program SSSSg Maximum likelihood is an optimality criterion, like maximum parsimony, for the reconstruction of phylogenies Maximum likelihood methods differ from the non-parametric parsimony approach because they use an explicit model of character evolution for tree construction Both maximum parsimony and maximum likelihood methods recovered the same five, high-level branching events within the human SDR family, and lower-level topological differences were negligible As such, we arbitrarily chose to use the maximum likelihood tree for all subsequent analyses

Using Consensus Sequence Constructor, we identified a single consensus protein sequence for each of the five human SDR subgroups A comparison was carried out

on the five resultant consensus sequences in order to identify the conservative and variable sequence regions in human SDR enzymes

To further elucidate patterns of conservation and variation in human SDR enzymes, a comparative analysis of the 3D protein structure of each of the five consensus sequences was also conducted We identified a representative structure for each of the five groups recovered in the phylogenetic reconstructions using the Protein Data Bank The selection criteria focused on the maximum identity of the sequence alignment from all members in each group, and the highest degree of similarity at

the tertiary structural level

3.4 Mutational variability of human SDRs

Mutational Variability (Talana, Consurf)

Mutational variability was carried out to highlight the conserved and variable regions

in human SDR’s sequences and structures

In our study, we applied to independent soft-wares-Talana and Consurf Consurf server is used for estimating the evolutionary conservation of amino acid based on the phylogenetic relations between homologous sequences

3.4.1 Consurf

The first step in Consurf is to find sequence homologies (using BLAST) based on protein structures and amino acids sequences Sequences are clustered and highly similar sequences are removed using CD-HIT and cut off (95%) Then, multiple sequence alignment is created and phylogenetic tree is also created based on multiple sequence alignment by using Neighbor Joining

Trang 24

The second step, maximum likelihood calculates position-specific conservation scores (depend on the users choice)

The third step is used to calculate conservation scores which are divided into discrete scale of nine grades for visualization For example, grade 1-the most variable position-colored turquoise, grade 5-the intermediately conserved-colored white and grade 9-the most conserved-colored maroon

The conservation score at a site corresponds to the site’s evolutionary rate It measures of evolutionary conservation at each sequence site of the target chain The color grades are assigned as follow: the conservation scores below the average (negative values, are indicative of slowly evolving, conserved sites) are divided into

45 equal intervals The score 45 intervals are used for the score above the average (positive values, rapidly evolving, variable sites)

3.4.2 Talana

Similar to Consurf, Talana is used to calculate the number of different amino acids that occupy particular position in a provided MSA Chart, scripts used to visualize the availability on a PDB profile In addition, Talana produces the conservation scores into 12 grades Grade 1 and 2 are the most conservative and are in darkest blue color, grade 3 and 6 are the intermediately conservative and are in light blue and white color whereas the grade 7 to 9 are the most variable and are in pink and red color

3.4.3 Mutational variability of human SDR protein family

We used the five representative structures we identified (Table 1) together with all protein sequences available in each group identified in our phylogenetic analyses to study the mutational variability within the five subgroups of the human SDR family ConSurf (available at consurf.tau.ac.il) and Talana (available at

http://www.bioware.republika.pl/) were used to identify conservative and variable residues within functional regions in the aligned homologous sequences Consurf and Talana are used for estimating the evolutionary conservation of amino acids based

on the phylogenetic relationships between homologous sequences

Both programs analyzed the evolutionary conservation of amino acids based on the sequences and produce conservation scores that correspond to the rate of evolution

at each site The scores are divided into nine grades for the visualization of differing rates of evolution in Consurf: grade 1 is the most variable position and is colored turquoise; grade 5 is the intermediately conserved position that is colored white; and grade 9 is the most conserved position and is colored maroon Alternatively, in Talana, the conservation scores are divided into 12 grades: grade 1 is the most conserved position (darkest blue); grade 6 is the intermediately conserved position (white); grade 12 is the most variable position (darkest red) After the conservation score has been calculated for each site, both programs automatically project the value for each sequence onto the consensus protein structures Results from both

(http://www.geneinfinity.org/rastop) and mutually compared for verification of their compatibility

Table 1: PDB code and name of five representative

Groups PDB code and name of representative structures

1 3edm chain A ,Uncharacterized Oxidoreductase SSP0419

2 1hdc chain A, Retinol Dehydrogenase 7

3 1yb1 chain A, 17-beta hydroxysteroid dehydrogenase 13

4 3rd5 chain A, Retinol dehydrogenase 11

5 1q7b chain A, 3-Oxoacyl-[acyl-carrier-protein] reductase FabG

Trang 25

3.5 Analysis of correlated mutations

Correlated mutations are the phenomenon of several mutations occurring simultaneously and dependent on each other According to the current hypothesis of molecular positive Darwinian, selection, correlation mutations are related to the change occurring in their neighborhood They reflect the protein-protein interaction and they preserve the biological activity and structure properties of the molecules [40]

In this project, we also studied mutational correlation among human SDR members

in order to gain more understanding on protein-protein interaction among these protein family This information may be useful for further study on designing inhibitors The Corm and Talana are two soft-wares being used to accomplish this task

Lastly, we set out to investigate the tendency of different amino acids along human SDR proteins to mutate together It is clear that many residues within the same protein have evolved to form specific molecular complexes and that the specificity of these interactions are essential for their function To maintain functionality, it is reasonable to assume that the sequence changes accumulated during the evolution

of one of the interacting residues must be compensated by changes in the other [34,35,36] In this way, the network of necessary inter-residue contacts may constrain divergence of the protein sequence to some extent

Correlated mutations in representative protein structures and corresponding consensus sequences in each subgroup of human SDRs were identified, localized and analysed with the aid of Talana and Corm (freely available for non-commercial

http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/corm.jar) The program FEEDBACK was implemented in Corm, which is designed to analyze the aligned protein sequences for the occurrence of correlated mutations It returns all possible residues occurring at all sequence positions of aligned proteins for each residue occurring at each position Talana produces a similar set of results, but also highlights correlated sequence mutations in the corresponding protein structures The candidate correlated sequence and structure mutations that were recovered using both software packages were compared and then visualized on the SDR template structure of the five groups using DSVisualizer1.7 of Accelrys (http://accelrys.com/products/discoverystudio/visualization-download.php) and/or Rastop2.2 (http://www.geneinfinity.org/rastop) The visualization of the protein sequence mutation correlation results from Talana and Corm provided an additional

method of investigating potential correlated mutations in protein structure

3.6 Availability of original software generated by authors

The original applications of Geisha 3, Consensus Constructor, SSSSg, Talana and Corm are freely available at the addresses listed above They are also available directly upon any request sent to the authors Additionally, the authors are willing to assist in the appropriate, effective running of all applications

4 RESULTS AND DISCUSSION

4.1 Multiple sequence alignment, consensus sequence generation, and analysis of human SDR specificity

After multiple sequence alignment and verification, we identified four sequences (P49327, P14060, P56159, and P56937) that shared very low sequence identity with the rest of the members of the human SDR family, and were removed from

Trang 26

subsequent analyses We constructed the consensus sequence from the remaining

71 sequences, and used it to identify features of general and individual specificity

Our comparative analyses revealed little overall general specificity and much individual specificity amongst human SDR sequences (figure 4) Among 306 positions

in the consensus sequence, only 5 positions-bold letters (1.6%) are occupied by the same residue in more than 70% of sequences, whereas 105 positions (34.3%) are occupied by the same residue in at least 30% of sequences 196 positions-X letters (64.1%) are occupied by any particular residue in more than 30% of sequences

Figure 4: Part of result of multiple sequence alignment of 71 human SDR’s member

Định dạng
Số trang	53
Dung lượng	1,84 MB