Biological Resources and Databases

Một phần của tài liệu Thiết kế các thuật toán sinh tin học bằng Python (Trang 79 - 85)

In Section3.4, we have seen that the molecular biology field went through great and quick ad- vances in the last quarter of the previous century. It certainly benefited from the development and dissemination of the Internet that was also occurring during this period. In those early days, the amount of sequence information was very limited. It was shared and communicated by printed pages or fitting in text files. The development of new sequencing techniques cre- ated an exponential growth of sequence data and this prompted the need to develop efficient, scalable and consistent ways of transferring and sharing the data.

One of the first efforts to collect sequence data was made by Margaret O. Dayhoff that com- piled the first comprehensive collection of protein sequences published from 1965 to 1978.

This gave raise to the Protein Information Resource (PIR,pir.georgetown.edu/) created in 1984.

The European Molecular Biology Laboratory (EMBL)-bank was created in 1982 as the first international database for nucleotide sequences. Also, in that year, there was the public release of Genbank (www.ncbi.nlm.nih.gov/genbank/), now maintained by the National Center for Biotechnology Information (NCBI) in the United States, which contained an annotated collection of public available nucleotide sequences and respective sequence translations. In 1986, the Swiss-Prot (now part of the UniProtKB) database was presented, containing non- redundant and curated protein sequence data complemented with other high level information and interconnected with other sequence resources.

In 1988, the International Nucleotide Sequence Database Collaboration (INSDC) was launched, a joint effort of EMBL-EBI in Europe, NCBI in the United States and DDBJ (www.ddbj.nig.ac.jp/) in Japan to collect and disseminate nucleotide sequences. It cur- rently involves the databases of DNA Data Bank of Japan, GenBank and the European Nu- cleotide Archive (ENA,www.ebi.ac.uk/ena).

During the years of the Human Genome Project, independent on-line browsers were created to share and provide a graphical display of the sequence assembly of the human genome.

The Ensembl genome database (www.ensembl.org/), a joint initiative of the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, and the genome browser at University of California Santa Cruz (UCSC,genome.ucsc.edu/) are the most well known browsers for genomic information. These browsers have evolved to very complete resources, integrating nowadays many different types of genomic information from several different species.

In 2002, the UniProt (www.uniprot.org/) consortium was created joining the protein se- quence databases of EBI, Swiss-Prot and PIR in a collection of curated and non-curated se- quences with high-level of annotation. With the massive growth of sequence data from many different formats, including those from high-throughput sequencing, lead to the replacement of EMBL-Bank with the European Nucleotide Archive integrating nucleotide sequence, asso- ciated data and the respective experiment annotations.

Sequence data is being generated in a multitude of laboratories worldwide. Most of the cur- rent sequence databases allow for scientists to submit their own generated data. While this allows a quicker and generalized dissemination of the data, it also creates some problems in terms of consistency, completeness and redundancy of the submitted data.

The reader should be aware that biological sequence databases can then be divided into pri- mary and secondary databases. Primary databases contain sequence data submitted by the researchers that may not be fully processed and it is not curated. Redundancy often exists in the form of multiple entries for the sequence of a gene or transcript submitted by different people working on related topics. Another critical issue is that, since data is shared by dif- ferent databases, an error or an inaccuracy in the data of the primary databases may easily propagate to other databases that use data from these sources. Examples of primary databases include the ENA database, Genbank or DDBJ.

Secondary databases contain data, which may be obtained from primary databases and that have been curated by specialists in terms of consistency and completeness. The data entries are complemented and annotated with metadata and additional information. Examples include the NCBI RefSeq (www.ncbi.nlm.nih.gov/refseq/) database that has been curated from Genbank or the UniProtKB/Swiss-Prot (www.uniprot.org/uniprot/).

We now describe some of the databases that are of relevance in the context of this book and where the user can access to obtain data for its own experiments.

ENA (includes EMBL-bank) –www.ebi.ac.uk/ena GenBank –www.ncbi.nlm.nih.gov/GenBank DDBJ –www.ddbj.nig.ac.jp

These are the three main primary databases of publicly available nucleotide sequences. They integrate the INSDC consortium and share data among them, being periodically updated. Each database has its own data format.

NCBI Gene –http://www.ncbi.nlm.nih.gov/gene

This is a gene centric database containing data from multiple species. Beyond the sequences, it integrates aspects like genotypic variation of the gene, associated phenotypes or molecular pathways in which the gene is involved.

NCBI RefSeq –http://www.ncbi.nlm.nih.gov/refseq

This is a secondary database that processes data from the primary database GenBank to deliver a compre- hensive database of curated information integrating data from the genome, transcriptome and proteome. The RefSeq dataset comprises an important reference to be used in genome annotation and characterization studies being widely used in species comparison or in gene expression analysis.

Gencode –www.gencodegenes.org

The Gencode annotation started as an effort within the ENCODE project [27,49,51] to provide a fully inte- grated annotation of the human genome. It provides a comprehensive annotation of the human gene set that has been used by ENCODE and many other different projects. It contains information not only on protein- coding genes, but also on many other different RNA types. Currently, it is expanding to include the annotation of the mouse genome.

UCSC Genome Browser –https://genome.ucsc.edu/

Ensembl –http://www.ensembl.org/

These are two genome browsers that allow online viewing and download of genomic data from multiple species. Here, we can visualize, by zooming in and out on the genome, the genomic sequence, the gene anno- tation, the conservation of the sequence across species, regulatory and disease data. These browsers also allow researchers to upload their own data for visualizations within the selected genomic context. The browsers al- low the download of full datasets of sequence and annotation data that have been previously processed and are easy of use.

NCBI Protein –http://www.ncbi.nlm.nih.gov/protein

This database aggregates protein data from multiple sources including sequence translations from genes in Genbank or RefSeq to curated entries retrieved from Swiss-Prot, PIR or PDB.

UniProt –http://www.uniprot.org/

UniProt is an integrated and comprehensive repository on sequence and functional protein information. It gathers data from multiple other databases including Swiss Prot, TrEMBL and PIR-PSD. Within UniProt we can find three databases: UniParc, UniProtKB and UniRef. UniParc is a non-redundant and comprehensive database of publicly available protein sequences. UniRef provides clustered sets of protein sequences from UniProtKB and UniParc. UniProtKB, the most important resource, provides functional information on the proteins, containing a curated database (SwissProt) and a non-curated component (TrEMBL).

Protein Data Bank (PDB) –http://www.rcsb.org/

PDB is a database that contains structural data of proteins, nucleic acids and other complex assemblies. It pro- vides functionalities for data deposit and download and tools for data visualizations in multiple data formats.

It contains information organized by protein, containing from the sequence, annotations of secondary struc- ture, tri-dimensional coordinates and views, similarity search at sequence and structure level and the details of the experiment

dbSNP –www.ncbi.nlm.nih.gov/snp dbVar –www.ncbi.nlm.nih.gov/dbvar

These are two databases from NCBI that contain the annotation of short genetic and large structural variations within the human genome and other species. dbSNP is mostly focused on point mutations, microsatellites, and small insertions and deletions. It contains information on the mutated alleles, their sequence context, visualization of their occurrence within the gene sequence, frequency in populations and also connects with other databases to show information on clinical significance. dbVar entries contain a view and details of the genomic region where the variation occurs, complemented experimental evidence and validation, publication where its was first reported and clinical associations.

ClinVar –www.ncbi.nlm.nih.gov/clinvar/

ClinVar is a database that provides information and supporting evidence on the association of human genetic variation and phenotypes. It is particularly useful in the clinical and health context since it reports variants found in patient samples along with assertions made by the researchers or the clinicians that submitted data about the clinical relevance of these variants.

Gene Expression Ominbus (GEO) –www.ncbi.nlm.nih.gov/geo/

GEO is a database from NCBI that collects gene expression datasets obtained either with micro-array or se- quencing technologies. The database is organized into datasets that may contain multiple samples. In each entry a reference to the platform in which the data was generated, the raw and the processed data and the arti- cle in which the data was published are reported. It is possible to search the datasets by keyword.

PubMed –www.ncbi.nlm.nih.gov/pubmed/

PubMed is a database from NCBI that indexes information of scientific articles related to biomedical and life sciences research. It contains currently pointers to more than 27 million articles and books. Article entries contain links to publisher website. Advanced article search can be made based on keywords, title or author names.

The exponential growth of sequence data urged the development of multiple efforts to cat- alogue, share and disseminate this data through online databases. Many different databases have been proposed and are currently available containing comprehensive repositories of nucleotide and protein sequences often complemented and integrated with additional infor- mation from other databases. Many other databases have been developed to integrate and complement with the genetic information.

Bibliographic References and Further Reading

This chapter contains a very basic approach to some major concepts in cellular and molecular Biology. Many different textbooks cover these concepts, and many other, in a deeper way [10, 37,120].

Exercises

1. Sort, by increasing order of organizational structures, the following cellular elements:

cell, nucleotide, chromosome, gene, DNA.

2. Consider a word of length 5. Indicate the number of possible word combinations:

a. based on the DNA alphabet?

b. based on the RNA alphabet?

c. based on the protein alphabet?

3. Consider a DNA sequence with 12 nucleotides:

a. Indicate the number of codons that can be derived from direct reading of this se- quence?

b. Consider that, in the beginning of the sequence, we find the start codon and in the end one of the stop codons. What is the total number of codons that can be derived from this reading?

c. Indicate the total number of reading possibilities, i.e. all the reading frames?

4. The following sequence represents part of a DNA sequence where in upper case are represented exons and lower case the introns.

> Exon-Intron-Exon sequence

ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTC CTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCAC GCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGgtgaggctccctcccctg ctccgacccgtgctcctcgcccgcccggacccacaggccaccctcaaccg tcctggccccggacccaaaccccacccctcactctgcttctccccgcagG ATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCT GAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACG CGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCC GCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTT CAAG

• Based on the previous sequence indicate the RNA sequence that will be obtained after mRNA splicing.

• Discuss what would happen if an SNP occurs in the border of the first exon to the intro, g→t. Propose a possible alternative for the mRNA sequence after splicing?

5. Suppose that you are studying a very short protein sequence. We would like to know the sequence of nucleotides that gave rise to this protein sequence: Met-Ala-His-Trp.

a. How many mRNA sequences can code this protein?

b. What is the length of the mRNA?

c. How many nucleotides from the mRNA are critical to code for this protein, i.e. how many of the nucleotides will have a direct impact in the amino acid sequence?

Hint: i) Use the genetic code table provided in Table3.1. ii) Remember that mRNAs coding for proteins contain a stop codon. iii) One amino acid can be coded by multiple codons.

6. From the NCBI Gene, retrieve the genomic sequence of the TP53 gene in human.

Obtain additional information on this gene by looking for the entry in the Genbank database. Retrieve the list of protein sequences derived from each of the isoforms of the gene.

Hint: FASTA is a format to organize biological sequences. Search for the entry that con- tains this file.

7. From the NCBI RefSeq, find the coding sequence and the protein sequence for the TP53 gene in human, chimp and mouse.

Hint: Search for the gene name, select the gene entry corresponding to each of the species and use the “Send to” to obtain the nucleotide or the protein file for the coding sequence.

8. From PDB, retrieve the files with the primary structure and the tri-dimensional structure of one of the versions of the Transthyretin protein in human.

9. In the UCSC Genome Browser, visualize the available information for the MDM2 gene in human.

a. For one of the Gencode isoforms, obtain the genomic, mRNA and protein sequence.

(Hint: click on the transcript/isoform and go to the “Sequence and Links to Tools Database” section.)

b. Visualize the RNA-seq expression data from GTEx and find in which tissues this gene has higher expression.

c. Find all the alternate gene symbols for this gene.

10. In the Ensembl Genome Browser, visualize the available information for the MDM2 gene in human.

a. For the longest transcript, obtain the coding sequence (click on the gene structure of the longest transcript).

b. How many splice variants do you find for this gene? How many are protein coding?

(See transcript table.)

c. Compare the evolutionary tree of MDM2. (Use the Comparative Genomics/Gene Tree tool.)

d. Find a germline and a somatic SNP occurring within the coding sequence of the gene. Somatic variants are identified byCOSMprefix and germline variants by an rsprefix. (Use the Genetic Variation/Variant table tool.)

Basic Processing of Biological Sequences

In this chapter, we address the computational representation of biological sequences and cover basic algorithms for their processing. We will address the implementation of the processes related to gene expression, covering transcription, translation and the identification of open reading frames. We also cover the implementation of a class for biological sequences (in- cluding DNA, RNA and protein sequences). Finally, we will review a set of classes from the BioPythonpackage to store and process sequences, together with their annotations, allowing their loading from databases and reading/writing from files in different formats.

Một phần của tài liệu Thiết kế các thuật toán sinh tin học bằng Python (Trang 79 - 85)

Tải bản đầy đủ (PDF)

(395 trang)