1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Comparative analysis of microbial genomes architecture and applications

224 241 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 224
Dung lượng 3,73 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

96 Table 5.3 List of genes present as overlapping genes in Rickettsia prowazekii and Rickettsia conorii with same number of nucleotides in overlap.. 103 Table 5.4 List of genes present

Trang 1

ARCHITECTURE AND APPLICATIONS

KISHORE RAMAJI SAKHARKAR

(M.Tech, Computer Science, BITS, Pilani, India)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF MICROBIOLOGY YONG LOO LIN SCHOOL OF MEDICINE NATIONAL UNIVERSITY OF SINGAPORE

2006

Trang 2

DEDICATED TO MY PARENTS

“SHEVANTI AND RAMAJI”

Trang 3

ACKNOWLEDGEMENTS

I am very grateful to Professor Vincent Tak Kwong Chow who introduced me to this fascinating field of microbial genomics I take this as a special opportunity to thank him profusely for his patience, continued support, guidance and encouragement My heartfelt thanks to Professor Pervaiz Shazib, Vice-Dean, Research, Yong Loo Lin School of Medicine, National University of Singapore for extending all possible help and support throughout my research work

I am indebted to Professor Micheal and Professor Olson of Genome Research Centre,

Washington, USA for providing access to essential genes in Pseudomonas aeruginosa

Special thanks are due to Prof Stanley Falkow, Stanford University, USA and Prof

Salama, Fred Hutchinson Cancer Centre, USA for making available the H pylori

mutagenesis data

During this period of my doctoral research program, I was certainly blessed to interact with many eminent scientists in this emerging field of BioInformatics The optimism and critical comments showered on the project were invaluable

My thanks are also to Ms Geetha Sreedhara Warrior, Ms Siti Maryam Binte Masnor,

Ms Stacy Tan and Ms Geetha Baskaran for administrative support

Last but not least; my thanks are due to my wife, Dr Meena Kishore Sakharkar, my daughter Neha and my son Anurag for their endless support and love

Trang 5

4.3.4 Metabolic pathway comparison of Glycolysis, TCA and

Pentose Phosphate Pathway

68

Trang 6

4.5 Caveats 86

Trang 7

CHAPTER 7: IDENTIFICATON OF DRUG TARGETS 137

Trang 8

SUMMARY

The availability of complete genome sequences of many bacterial species is, for the first time, facilitating many computational approaches for understanding bacterial genomes One of the major incentives behind the genome sequencing of numerous pathogenic bacteria is the desire to better understand their peculiarities and to develop new approaches for controlling human diseases caused by these organisms This task has become even more urgent with the rapid evolution of antibiotic resistance in many bacterial pathogens Novel drug targets are required in order to design new defenses against antibiotic-resistant pathogens The availability of the complete genome sequence

of many pathogenic microbes provides information on every potential drug target and is

an invaluable resource in the search for novel compounds This thesis is an attempt to computationally analyze the host-specific adaptations in obligatory intracellular bacteria and develop tools that can facilitate genome analysis towards better understanding of bacterial genome evolution and accelerating computational identification of microbial drug targets

On one hand we have developed novel tools like GOV and PPD for understanding microbial genome design, architecture and evolution We then utilized the data derived from these tools for the understanding of host-specific adaptations in reduced genomes (obligatory intracellular parasites) and their intimate one sided associations with eukaryotic cells We demonstrate that gene loss in these bacteria is differential, function dependent and independent of protein length It is revealed that there is substantial sharing of ‘backbone genome’ in all the obligatory intracellular parasites Further filtering of these targets may help us identify a target that is “selective and specific” for

Trang 9

all the obligate genomes A substantial proportion of genes in the “backbone genome” have overlapping gene architecture and are involved in important cellular functions Certain overlapping genes are also found to be involved in gene fusion events Genes involved in fusion are identified as essential genes and these could again be putative drug targets It is known from our analysis that fusion genes have incremental structural and functional architectures and that inter-genic DNA has a significant role in these enhanced attributes and have contributed to genome evolution

We have developed an in silico approach for the identification of putative drug targets in

microbial genomes and have confirmed our findings by comparison with experimental data These processes are efficient ways for exploring genomes at niche and life-style level, enriching potential target genes, and for identifying those that are critical for normal cell function The comprehensive essential gene lists generated will allow an accelerated genetic dissection of traits such as metabolic flexibility and inherent drug resistance Such a strategy will enable us to locate critical pathways and steps in pathogenesis; to target these steps by designing new drugs; and to inhibit the infectious agent of interest with new antimicrobial agents These results underscore the utility of

large genomic databases for in silico systematic drug target identification in the

post-genomic era

Trang 10

LIST OF TABLES

Table 4.3 Pathway alignment table for Glycolysis and Gluconeogenesis 73 Table 4.4 Pathway alignment table for Pentose Phosphate Pathway 74

Table 4.6 List of genes in ‘backbone genomne’ 1 implies essential by

match in essential genes set of M.genitalium

80

Table 5.2 Number of genes in different directions of overlap in genomes

under study

96

Table 5.3 List of genes present as overlapping genes in Rickettsia

prowazekii and Rickettsia conorii with same number of

nucleotides in overlap

103

Table 5.4 List of genes present as overlapping genes in Rickettsia

prowazekii and Rickettsia conorii with different number of

nucleotides in overlap

105

Table 5.5 List of genes present as overlapping genes in Rickettsia

prowazekii and at zero inter-genic distance in Rickettsia conorii

107

Table 5.6 List of genes present as overlapping genes in Rickettsia

prowazekii and at inter-genic distance of at least 1bp in

Rickettsia conorii

108

Table 5.7 List of genes present as overlapping genes in Rickettsia conorii

and at inter-genic distance of at least 1bp in Rickettsia

prowazekii

109

Table 6.3 List of genes present as fusion genes in strain H.pylori J99 and

split juxtaposed genes in H pylori 26695

132

Trang 11

Table 6.4 List of genes present as fusion genes in strain H.pylori 26695

and split juxtaposed genes in H pylori J99

Trang 12

LIST OF FIGURES

Figure 1.1 Number of Prokaryotic, Archeal and Eukaryotic genomes

sequenced since 1995

9

Figure 2.2 Gene order for translation genes in the five Mycoplama

Figure 4.2 Distribution of COG categories and their percentage

representation in the bacterial genomes

53

Figure 4.3 Protein length distribution profiles of the prokaryotic

genomes under study

57

Figure 4.4 Percentage representation of proteins vs protein size in each

COG category

59

Figure 4.5 Reaction scheme representing a part of glucose metabolism

by Glycolysis, PPP and TCA cycle

76

Figure 5.2 The correlations between genome size and overlapping

genes

100

Figure 6.1 Intergenic DNA and gene fusion Alignments exhibiting the

mechanism of fusion for three non-overlapping genes separated by intergenic DNA

121

Figure 6.2 Overlapping genes and gene fusion Alignments displaying

the mechanism of fusion for three overlapping genes

125

Trang 13

Figure 7.1 Methodology for the identification of putative drug targets 143 Figure 7.2 The percentage distribution of 306 essential genes into

different classes of proteins

149

Figure 7.3 COG category distribution for 297 essential Salmonella

genes

158

Trang 14

ABBREVIATIONS

Cbu Coxiella burnetii

Ega Ehrlichia ruminantium str Gardel

Eru Ehrlichia ruminantium

Eca Ehrlichia ruminantium str.(W) (Af)

Rco Rickettsia conorii

Rfe Rickettsia felis

Rpr Rickettsia prowazekii

Trang 15

Rty Rickettsia typhi

Trang 16

CHAPTER 1

INTRODUCTION

Trang 17

INTRODUCTION

The smallest functional unit of inherited information is a gene (Morgan, 1917) A

gene is a DNA sequence and most genes contain information for making specific proteins Each DNA molecule is composed of two polynucleotide strands twisted

around each other to form a double helix (Watson et al, 1953) The polynucleotide is

made up of four types of nucleic acid bases (Adenine, Thymine, Guanine and Cytosine represented as A, T, G and C respectively) Each strand has a chemical polarity, described as going from 5’ end to a 3’ end, and this is based on the position

of the carbon atom on the pentose ring to which phosphate groups bind in either direction The genetic code is read as a series of codons, each of which consists of

three base pairs (bp), which in turn correspond to a single amino acid (Crick, 1968)

The word ‘Genome’ was introduced into the scientific vocabulary by Winkler, in

1920 as a conjunction between GENe and chromosOME and it stood for the

complete haploid set of chromosomes and genes (Winkler, 1920) Life as we know it

is specified by genomes Every organism possesses a genome that contains the biological information needed to construct and maintain a living example of that organism Most genomes, including those of all cellular life forms, are made up of DNA, but a few viruses have RNA genomes

Trang 18

1.2 GENOMES OF PROKARYOTES

Biologists divide the living world into two types of organisms:

• Eukaryotes, whose cells contains membrane bound compartments, including

a nucleus and organelles such as mitochondria and, in case of plant cells, chloroplast Eukaryotes include animals, plants, fungi and protozoa

• Prokaryotes, whose cells lack extensive internal compartments, are very different from eukaryotes Not only are their genomes much smaller, the physical organizations are also different In prokaryotes, most if not all of the genome is contained in a single circular DNA molecule, rather than linear In addition to this single ‘chromosome’ prokaryotes may also have additional genes on independent, smaller, circular or linear DNA molecules called plasmids Prokaryotes generally have fewer genes than eukaryotes and most

of the DNA is coding that is there is very less intergenic DNA

Microbes make up about 60% of the earth's biomass, yet less than 1% of microbial species have been identified Microbes have been found surviving and thriving in an amazing diversity of habitats, in extremes of heat, cold, radiation, pressure, salinity, and acidity, often where no other life forms could exist Microbes play a critical role

in natural biogeochemical cycles Because most do not cause disease in humans, animals, or plants and are difficult to culture, they have received little attention Identifying and harnessing their unique capabilities, which have evolved over 3.8 billion years, will offer us new solutions to longstanding challenges in environmental

Trang 19

and waste cleanup, energy production and use, medicine, industrial processes, agriculture, and other areas

1) Industry

The bacterium Shewanella putrefaciens, which can grow with or without oxygen, is

an excellent model system for manipulating organisms for remediation genome sequencing can elucidate metabolic pathways including those involved in corrosion, consumption of toxic organic pollutants, and removal of toxic metals and radiation waste by conversion to insoluble forms Scientists are also starting to appreciate the role played by microbes in global climate processes, and we can expect insights about both the biological underpinnings of climate change and the contributions of microbes to earth's biosphere Their capabilities soon will be added

Whole-to the list of traditional commercial uses for microbes in the brewing, baking, dairy, and other industries

The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential for practical applications in industry and government-funded

environmental remediation Because they thrive in water temperatures above the boiling point, these organisms may provide DOE, the Department of Defense, and private companies with heat-stable enzymes for use in industrial processes These

processes could include conversion of wastes to useful chemicals A fulgidus has the

added capability of surviving at the high pressures associated with deep oil wells,

and T maritima metabolizes simple and complex carbohydrates, including glucose,

sucrose, starch, xylan, and cellulose Cellulose and xylan are the most abundant biopolymers on earth and, through their conversion to fuels such as ethanol, have major potential as sources of renewable energy Comparisons of the genomic

Trang 20

sequences of these two microbes can contribute to a greater understanding of evolutionary relationships as well as high-temperature protein function

The archaeon Pyrobaculum aerophilum, first isolated from a boiling marine

vent, thrives at temperatures close to the maximum tolerated by living systems (113oC) Unlike most hyperthermophiles, P aerophilum is able to withstand

exposure to oxygen and can thus be manipulated more easily in the laboratory Also, the proteins encoded by hyperthermophilic genomes are more stable than those of organisms living in more temperate environments

2) Minimal genome

The fully characterized tiny M genitalium genome thought to have the smallest

genome of any known free-living bacterium provides a model for a minimal set of genes necessary for life Its genome contains only 580,000 base pairs of DNA and yet encodes 470 genes Future studies on this and other minimal genomes will help increase our understanding of more complex genomes

3) Microbial diversity Evolution of life

Among the oldest life forms known, the archaea make up one of three phylogenetic

or evolutionary domains into which all life is classified The other two are the eukarya and the bacteria Archaea found thriving in extreme environments of heat and cold, acidity, pressure, and salinity, are known as extremophiles ("extreme-loving" organisms) Understanding the biological mechanisms underlying their hardiness may help researchers develop new industrial, biomedical, and environmental applications Microbes may, for example, contain enzymes that are effective in driving chemical reactions in extreme environments Some may provide

Trang 21

enzymes useful in research; one such "extremozyme" derived from a bacterium living in hot springs in Yellowstone National Park has become critical to current protocols for sequencing any genome, including that of humans Other microbes have metabolic processes with potential for breaking down toxic waste or even producing methane, an energy source Comparisons of the genomes of organisms from all three domains are helping scientists better understand the evolution of all living things

4) Genome sequencing

Methanococcus jannaschii was among the first archaea chosen for sequencing In

1996 its completed sequencing and analysis confirmed that the "tree of life" has three domains, a hypothesis first advanced nearly 20 years before by Carl Woese

(University of Illinois) but not given much credence at the time The single-celled M jannaschii was isolated from a sample collected beneath more than 8000 feet of

water at the base of a deep-sea thermal vent on the floor of the Pacific Ocean The microbe lives without the sunlight, oxygen, and organic carbon important to most other forms of life and uses carbon dioxide, nitrogen, and hydrogen expelled from

the thermal vent for its life functions When the entire DNA sequence of M jannaschii was determined, scientists found that about 65% of its potential gene

sequences were not related to any gene previously discovered, representing an exciting area for future investigation These collections offer a rich resource for identifying and isolating novel species with potentially unique sets of genes as well

as proteins with environmental, energy, biotechnological, and other applications

Trang 22

1.4 MICROBIAL GENOME PROGRAM

DNA sequencing made available the ability to look at the complete genetic information contained in a haploid set of chromosomes and to compare them to other organism, that form the essence of ‘genomics’ Genomics is a field that analyses and compares the complete genome sequences of organisms Sequence provides the most fundamental information about an organism The genes and the regulatory sites encoded in the sequence specify the ‘parts list’ and ‘operating instructions’ for the organism and yields clue to its evolution To explore the possibilities for new applications, in 1994 the U.S Department of Energy (DOE) established the Microbial Genome Program (MGP) as a companion to its Human Genome Program (HGP) A principal goal of this project is to determine the complete DNA sequence - the genome - of a number of nonpathogenic microbes that may be useful to DOE in carrying out its missions (nonpathogenic microbes do not cause disease) The microbes chosen for genomic sequencing were selected with broad input from the scientific community "The microbial diversity of the program is an absolute treasure trove for [research in] biotechnology, ecology, evolution, and bioremediation," notes David Schlessinger (National Institute on Aging)

Only a few years ago, scientists could not have imagined having full access

to the genetic structure of more than a few such organisms Today, a number of complete microbial genomes (both pathogenic and non-pathogenic), many supported

by DOE's MGP, have been sequenced, and the rate of reported new genome sequences is increasing rapidly The number of possible applications of this information is staggering Sequenced genomes provide us with a genetic "parts" list; the next challenge is to explore how these parts come together to form a functioning organism Additionally, the MGP is developing new tools to study how groups of

Trang 23

genes work together to produce specific products or determine particular behaviors Other objectives are to mine genomic information from sequenced microbes, improve tools for annotation and analysis of sequence data, develop high-throughput methods for determining gene function and gene expression, and develop methods for examining protein-protein and protein-nucleic acid interaction Figure 1.1 shows the rapid increase in number of sequenced microbial genomes

Trang 24

9

Figure 1.1: Number of Prokaryotic, Archeal and Eukaryotic genomes sequenced since 1995

Trang 25

The future promises many exciting developments as the fruits of the MGP mature Already, we have become more appreciative of the extent of the microbial world's effect on earth, realizing how little we know about this kingdom and wondering at its potential benefits to our world

GENOMICS

Comparative genomics is the study of the differences and similarities in genome structure and organization in different organisms For example, how are the differences between different bacteria reflected in their genomes? How similar are the number and types of proteins in different bacteria sequenced till date Essentially, comparative genomics is no more than the application of the bioinformatics methods to the analysis of whole-genome sequences with the

objective of identifying biological principles, i.e biology in silico

There are two drivers for comparative genomics One is a desire to have a much more detailed understanding of the process of evolution at the gross level (the origin of the major taxonomic classes of prokaryotes) and at a local level (what makes related species unique) The second driver is the need to translate DNA sequence data into proteins of known function The rationale here is that DNA sequences encoding important cellular functions are more likely to be conserved between species than sequences encoding dispensable functions or non-coding sequences

For evolutionists the revolution in DNA technology has been a major advance The reason is that the very nature of DNA allows it to be used as a

"document" of evolutionary history: comparisons of the DNA sequences of

Trang 26

various genes between different organisms can tell us a lot about the relationships of organisms that cannot be correctly inferred from morphology One definite problem is that the DNA itself is a scattered and fragmentary

"document" of history and we have to beware of the effects of changes in the genome that can bias our picture of organismal evolution

A large number of various microbial genomes sequenced recently for the first time make it possible to analyze evolutionary changes at a whole genome level, unlike a single gene level Intraspecies and interspecies comparisons of the sequenced genomes demonstrates that the organism’s complexities do not directly correlate with the number of genes and suggest the importance of combinatorial interactions in cells and organisms as a major player in the complexity of live systems They made it possible to reveal conserved and variable elements of the genomes and to suppose that tens of thousands of proteins are made of just about 1,500-2,000 discrete structural protein units called domains or modules Different modular proteins are formed from these modules taken in different combinations, and this shuffling might play an extremely important role in the genesis of evolutionary novelties The new domain architectures (defined as the linear arrangement of domains within a polypeptide) have emerged in evolution by shuffling, adding or deleting domains, resulting in new proteins composed of old parts More complex organisms seem to contain more various protein architectures than the simpler ones Whole genome analyses developed in parallel and interdependently with the development of new concepts of evolution such as evolutionary developmental biology (“Evo-Devo”) aimed at explaining how developmental processes and mechanisms become modified in evolution, and how these

Trang 27

modifications produce changes in animal morphology and body plans Among these new concepts can be found such fruitful notions as (i) a universal principle

of modular organization at various levels of living systems, particular modules being changed and co-opted into new functions without affecting other modules, (ii) a concept of network-like organization of cellular regulatory systems with cis-regulatory elements of the genome functioning as major nodes of the networks, and a crucial evolutionary role of changes in the regulatory systems, (iii) an assumption of increase in functional load per regulatory gene with increasing the complexity of the organism, (iv) an idea of evolvability as a universal feature of the living entities, and a very important concept (v) that not only natural selection, but also internal developmental biases can form the basis for evolutionary changes

As of February, 2006, the website of the National Center for Biotechnology listed 293 bacterial genomes (25 Archaea and 268 Eubacteria) whose genomes have been sequenced

(ftp://ncbi.nlm.nih.gov/genbank/genomes/bacteria)

Information available in Appendix-I Simple analysis of the sequence data reveals two features of note First, the genome sizes vary from 0.49 Mb (Nanoarchaeum equitans) to 9.12 Mb (Streptomyces avermitilis MA-4680), i.e a more than 18-fold difference (Figure 1.2 and Figure 1.3) Secondly, the gene density is generally similar across all species and is about 1 gene per kilobase of DNA A correlation of 0.898 is observed between genome size and number of genes (Figure 1.2) A similar correlation of 0.898 is also observed between

Trang 28

genome size and number of proteins (Figure 1.3) This means that large prokaryotic genomes usually contain many more genes than smaller ones By contrast, the human genome contains only twice as many genes as Drosophila

So how can we account for the size diversity of prokaryotes? When the different genomes are arranged in size order some interesting features emerge First, the archaebacteria exhibit a very much smaller range of genome sizes This could be

an artifact of the small number of genomes examined but more probably reflects the fact that most of them occupy a specialized environment and have little need

for metabolic diversity The exception is Methanosarcina acetivorans, this

bacterium is known to thrive in a broad range of environments and at 5.75 Mb

has the largest archaeal genome completely sequenced till date (Galagan et al

2002) Second, the smallest eubacterial genomes are found in those organisms that normally are found associated with animals or humans, e.g mycoplasmas, rickettsias, chlamydiae, etc Those organisms that can occupy a greater number

of niches have a larger genome size Not surprisingly, there is a good correlation between genome size and metabolic and functional diversity as demonstrated by

the size of the genomes of Bacillus and Streptomyces (formation of spores, antibiotic synthesis), rhizobia (symbiotic nitrogen fixation), and Pseudomonas

(degradation of a wide range of aromatic compounds)

Trang 29

Figure 1.2: A plot of genome size versus number of proteins

Figure 1.3: A plot of genome size versus number of genes

Trang 30

1.7 GENOME DATA FORMAT

1.7.1 Introduction

Nucleic acid sequences provide the fundamental starting point for describing and understanding the structure, function, and development of genetically diverse organisms

GenBank is the National Institute of Health’s genetic sequence database and is an annotated collection of all publicly available nucleotide and protein sequences

1.7.2 Overview of the Feature Table format

The overall goal of the feature table design is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them The Feature Table documentation represents the shared rules that allow the three databases to exchange data on a daily basis The range of features to be represented is diverse, including regions which:

• perform a biological function,

• affect or are the result of the expression of a biological function,

• interact with other molecules,

• affect replication of a sequence,

• affect or are the result of recombination of different sequences,

• are a recognizable repeated unit,

• have secondary or tertiary structure,

• exhibit variation, or have been revised or corrected

Trang 31

1.7.3 Format Design

The format design is based on a tabular approach and consists of the following

items:

• Feature key - a single word or abbreviation indicating functional group

• Location - instructions for finding the feature

• Qualifiers - auxiliary information about a feature

1.7.4 Key aspects of the feature table design

• Feature keys allow specific annotation of important sequence features

• Related features can be easily specified and retrieved

• Feature keys are arranged hierarchically, allowing complex and compound features to be expressed Both location operators and the feature keys show feature relationships even when the features are not contiguous The hierarchy of feature keys allows broad categories of biological functionality, such as rRNAs, to be easily retrieved

• Generic feature keys provide a means for entering new or undefined features

A number of "generic" or miscellaneous feature keys have been added to permit annotation of features that cannot be adequately described by existing feature keys These generic feature keys will serve as an intermediate step in the identification and addition of new feature keys The syntax has been designed to allow the addition of new feature keys as they are required

• More complex locations (fuzzy and alternate ends, for example) can be specified Each end point of a feature may be specified as a single point, an alternate set of possible end points, a base number beyond which the end point lies, or a region which contains the end point

• Features can be combined and manipulated in many different ways

Trang 32

The location field can contain operators or functional descriptors specifying what must be done to the sequence to reproduce the feature For example, a series of exons may be "join"ed into a full coding sequence

• Standardized qualifiers provide precision and parsibility of descriptive details A combination of standardized qualifiers and their controlled-vocabulary values enable free-text descriptions to be avoided

• The nature of supporting evidence for a feature can be explicitly indicated Features, such as open reading frames or sequences showing sequence similarity to consensus sequences, for which there is no direct experimental evidence can be annotated Therefore, the feature table can incorporate contributions from researchers doing computational analysis of the sequence databases However, all features that are supported by experimental data will

be clearly marked as such

• The table syntax has been designed to be machine parsible A consistent syntax allows machine extraction and manipulation of sequences coding for all features in the table

1.7.5 Feature Table Terminology

The format and wording in the feature table use common biological research terminology whenever possible For example, an item in the feature table such as:

Key Location/Qualifiers

CDS 23 400

/product="alcohol dehydrogenase"

/gene="adhI"

Trang 33

might be read as:

The feature CDS is a coding sequence beginning at base 23 and ending at base

400, has a product called 'alcohol dehydrogenase' and is coded for by a gene called "adhI"

A more complex description:

Key Location/Qualifiers

CDS join(544 589,688 >1032)

/product="T-cell receptor beta-chain"

which might be read as:

This feature, which is a partial coding sequence, is formed by joining elements indicated to form one contiguous sequence encoding a product called T- cell receptor beta-chain

One of the insights to come from bacterial sequencing and comparative genomics is that bacterial genome is a dynamic entry shaped by multiple forces that include genome reduction, genome rearrangement, gene duplication, and gene loss and gene acquisition by lateral gene transfer An interesting group that has been targeted for genome analysis represents species that are no longer free-living and, because of genome reduction, are now dependent on their hosts for survival Throughout the history of life, these obligate intracellular bacteria have acted as major evolutionary catalysts, being involved in the origin of organelles and the diversification of eukaryotes Present-day intracellular associations include a range of parasites,

Trang 34

mutualists and commensal symbionts that play important roles in the ecology and physiology of their hosts Owing to their medical and ecological importance, intracellular bacteria have been targets of numerous genome sequencing projects that have provided insights into the consequences of this specialized lifestyle We have learned that, typically, these species have drastically reduced genomes that encode a streamlined metabolism, show rapid DNA sequence evolution and strong nucleotide compositional biases, and exhibit lower levels of genome flux (i.e gene acquisition from foreign sources, and intragenomic changes such as inversions and translocations) The integration of population genetic processes with knowledge of bacterial physiology and ecology has helped to clarify mechanisms that might explain these common features In particular, current views of ‘reductive evolution’ emphasize that fundamental evolutionary processes — natural selection, mutation and genetic drift — might affect intracellular species differently than they do free-

living ones (Andersson et al, 1998 and Moran, 1996) For instance, genome

streamlining might reflect relaxed purifying selection on metabolic functions that are dispensable in a resource-rich intracellular niche In addition, strong effects of nucleotide mutations in intracellular bacteria might elevate rates of gene disruption,

followed by erosion owing to a deletion bias in bacteria (Mira et al, 2001) This is

thought to reflect their generally low levels of repeats and mobile DNA, reduced recombination functions, and limited opportunities for DNA exchange among

sequestered species (Frank et al, 2002)

However, not much data is available on computational analysis of obligatory intracellular parasites or reduced genomes as a group This will help to enquire whether or not there are any generalizations concerning the life-style of these prokaryotes that distinguish them from free-living bacteria When the author started

Trang 35

the work, there were 5 completely sequenced obligatory intracellular parasites (compared to twenty one as of Feb, 2006) and 4 species of Mycoplasma genomes (compared to over ten as of Feb, 2006) that were completely sequenced There were reports on the evolution of these genomes by genome reduction from larger genomes However, no data was available on comparative analysis of these genomes Thus, we embarked on a project to study the minimal genomes as a group with specific reference to gene order, protein length profiles, backbone genome, overlapping genes and gene fusion

1) The thesis starts with the description of tools (GOV and PPD) for the identification of commonalities among reduced genomes - Chapter 2 and Chapter 3

2) Obligatory intracellular parasites have reduced genomes The use of the derived information on differential gene loss with emphasis on which genes are lost, and which genes compose the backbone genome are elaborated in Chapter 4 It was interesting to find out that some genes in the “backbone genome” have overlapping gene architecture

3) Miyata and Yasunaga (1978) reported that rates of evolution are slower in overlapping genes Concurrently, we performed an analysis on overlapping genes

in two reduced genomes and compared it to E.coli These results are elaborated

in Chapter 5 It was interesting to see that a substantial proportion of overlapping genes have important cellular functions for example ribosomal proteins Ribosomal proteins are drug targets in various bacterial infections

Trang 36

4) Comparison of closely related obligatory bacteria revealed that some overlapping genes fused into one in another bacterial species Though, some of the genes by description appeared to have important cellular functions, a significant proportion of them were found to be hypothetical in nature and there was no way

to verify on the essentiality of these genes Recently, Salama et al (2005) published the transposon mutagenesis data on H pylori The role of overlapping

genes and juxtaposed genes in gene fusion and the identification of fusion genes

as putative drug targets is described in Chapter 6

5) A novel computational genomics approach which is based on substrative genomics is proposed for the identification of putative microbial drug targets in Chapter 7 This approach involves three steps The identification of essential genes in microbial genome, filtering of these genes against human genome for the identification of genes that do not share homology with any human genes and lastly, the validation of these genes against the transposon mutagenesis data for

their essentiality This analysis was performed on P aeruginosa & Salmonella species genome because transposon mutagenesis data is not available for any of

the twenty-two obligatory genomes sequenced (as of February 2006)

Trang 37

CHAPTER 2 GENE ORDER VISUALISER – GOV

Trang 38

GENE ORDER VISUALISER – GOV

GOV is a web-based interactive computational tool for visualization and comparison

of gene order from prokaryotic and selected viral genome data It can reveal many intriguing similarities and differences in gene order of multiple genomes by comparison The interface facilitates easy extraction of the nucleotide sequence of the gene of interest and BLAST analysis against GenBank at NCBI to provide insights into gene functions and orthologs of the gene in other species

The past decade has seen extension in the methods of sequence analysis from single gene based to analyzing multiple genes and proteins simultaneously Consequently, there is a need of software tools that will allow mining of these enormous datasets at genome level effectively A key challenge is to make them user-friendly, available to

a larger community and integration with public domain software without much hassle

2.3 BACKGROUND

Advances in sequencing technology have created opportunities for performing large-scale genome comparisons An increasingly large number of prokaryotic and viral genomes are becoming available to the scientific community which is evident from the exponential increase in the size of GenBank and other

genome databases (Benson et al., 2003) To date about 300 bacterial and numerous

viral genomes are completely sequenced These developments leave comparative

Trang 39

genomics poised to give a better understanding of biological functions and reveal higher order functional meanings from this data Parallel analysis of a number of diverse or related genomes can contribute to our understanding of their functional subsystems and the evolutionary forces shaping genome architecture

Presently, there are many systems focusing on specific types of comparisons

for many genomes e.g COG (Natale et al., 2000), PEDANT (Frishman et al., 2003), KEGG (Kanehisa, 2002), WIT (Overbeek et al., 2000) As more and more genomes

are being sequenced gene order comparison is emerging as an informative property

of the genomes which gives information not only on genome architecture but also on the functions and interactions of the proteins in genomes besides information on genome and organismal evolution (Tamamese, 2001) Gene order is thus increasingly becoming a popular paradigm in molecular biology Gene order studies have been carried out for a few bacterial species and eukaryotic organisms

(Mushegian and Koonin, 1996; Gilley and Fried, 1999 and Subramanian et al.,

2000) Gene order studies on fully sequenced viruses, mitochondria and chloroplast

have also been performed (Hannenhalli et al., 1995; Boore and Brown, 1998; Blanchette et al., 1999; Turmel et al., 1999 and Afonso et al., 2000) Recently, Pal

and Hurst reported the higher tendency of essential genes to remain adjacent during evolution (Pal and Hurst, 2003) Mazumdar and colleagues developed a WWW based interface for gene order comparison for small genomes under 200KB size The output is in the form of a table and it allows comparison of only two genomes at a

time (Mazumdar et al., 2001) We present here a new tool, Gene Order Visualiser

(GOV), that compares completely sequenced prokaryotic genomes and coronavirus genomes through multiple gene order comparison or functional description cluster criteria GOV visually portrays gene order information in a form that allows one

Trang 40

complete genome to be compared simultaneously with other complete prokaryotic and selected viral genomes Also the sequence for the gene of interest can be subject

to BLAST analysis at NCBI GOV analyses and visualization may shed light on selective pressures governing the clustering of genes and the conservation of gene order for specific genomes besides helping in gene annotation The system concentrates on two types of comparisons: (i) gene order between genomes and (ii) functional category level comparison based on the product descriptions in GenBank For gene order, we have developed a color code module that annotates genes based

on the functional descriptions they share, and compare the resulting clusters

side-by-side

2.4 METHODOLOGY

GOV is a web-based application that primarily uses the CDS information in the FEATURE TABLE and the nucleotide sequence in GenBank genome data entry to represent the genome architecture graphically The interface retrieves and displays the information on gene organization, gene length and gene description based on functional category in a user-friendly manner Hyperlinks from the gene structures allows one to view the relative positions and order of genes, to extract the sequences and display them in a separate window This accelerates the ability of scientists to access information about genes, gene clusters and transcription direction directly from GenBank genome data GOV is a general-purpose tool of potential use to anyone studying gene order and comparative genomics in prokaryotes or the recently detested coronavirus genomes Development of this tool involved compiling, calculating and standardizing gene names and gene order information from Genbank

Ngày đăng: 15/09/2015, 17:11

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN