Understanding and Assembling 454 Genome & Transcriptome data

• They are in binary format, so need converted to text format, such as a fasta file using the ‘sffinfo’ program • The Sequence Read Archive SRA at EBI or NCBI request that these .sff

Trang 1

Understanding and Assembling

454 Genome & Transcriptome data

Assembly Training

May 2011

Stephen Bridgett

Trang 2

Aims

•  Why sequence transcriptomes?

•  How does 454 sequencing work?

•  What are ‘sff’ files?

•  Using sff tools

•  What is assembly?

•  Challenges to assembly

•  Newbler assembler and Output files

•  Exercises with sample data

Trang 3

Why sequence transcriptomes?

•  Gives more dynamic view of the activity in a cell, (than

genome sequencing would) as:

•  Gives relative expression levels for different cells under

different conditions

•  Could identify alternate splicing, and fusion genes

(important in several cancers)

•  Focuses on gene sequences, which are often the main

research focus

Trang 4

How does 454 sequencing work ?

454 sequencer

DNA Capture bead,

emPCR, Pyrosequencing reaction,

Signal image, Base calling

Trang 6

Data obtained from 454 sequencing

  Roche 454 ‘titanium’ genome reads approx 400 bases long

  Transcriptome reads tend to be a bit shorter eg 350 bases

  Typically 700,000 reads from one sequencing plate

  Plates can be divided into 2, 4, 8 or 16 lanes

  Samples can have an MID (multiplex index) ‘barcode’

added, so several samples can be run together in the same lane

Trang 7

What are ‘sff’ files ?

•  ‘Sff’ files are Roche’s “Standard Flowgram Format” files,

containing the sequence data produced from a 454 run

•  The sff files contain:

•  a Manifest header at the start describing the contents,

•  flow intensity signal values for each base in each read

•  They are in binary format, so need converted to text

format, such as a fasta file (using the ‘sffinfo’ program)

•  The Sequence Read Archive (SRA at EBI or NCBI)

request that these sff files be uploaded, to obtain accession number for publications

Trang 8

What is Assembly?

  Merge the short reads into long contigs (ideally a full transcript),

by finding the best sequence overlaps between reads

  Eg: Roche’s Newbler assembler, MIRA assembler, TgiCl assembler, Phrap, Cap3,

MOSAIK reference guided assembler, etc

  This is an ‘overlap’ assembler (there are also deBruijn graph

assemblers to cope with the very large numbers of short illumina reads)

  Reads overlapped to form a contig, viewed in the gsAssembler graphical interface

  Newbler is an ‘overlap assembler’ There are also de-Bruijn graph assemblers designed to cope with the vary-large numbers of short reads from illumina or SOLiD, such as Velvet,

CLC cell, Cotex, SOAP-denovo, Abyss

Trang 9

Challenges for assembly (1)

•  Contaminants in samples (eg from Bacteria or Human)

•  Ribosomal RNA (small and large sub-units)

•  PCR artifacts (eg Chimeras and Mutations)

•  Sequencing errors, such as “Homopolymer” errors – when eg 3+

run of same base

•  MID’s (multiplex indexes), primers/adapters (eg SMART adapters

used to synthesise cDNA) still in the raw reads

•  Repeats and large or polyploid genomes – repeated sequences in the

transcriptome make assembly more difficult

Trang 10

Challenges for assembly (2)

•  Extra sample preparation steps in cDNA synthesis - more risk of

cloning errors or contamination, wider range of read lengths

•  Large expression level range (eg 105 ) - some transcripts have low read coverage and some very high coverage

•  Alternative splicing - differing

reads from same part of genome

•  Roche’s Newbler 2.3 assembler sometimes didn’t finish transcriptome

assembly, seemed to get lost when “Detangling Alignments”, but the

latest Newber 2.5 beta is able to

Trang 11

Blast search to check for contaminants

•  Blastx search of 5,000 randomly picked reads against UniRef90 or Non-redundant dataset

•  Sorted by frequency of Description (or Tax) with evalue > e-8

Frequency Subject_description

1689 (16.9 %) Picea sitchensis (Sitka Spruce)

907 (9.1 %) Vitis vinifera (Common Grape Vine)

311 (3.1 %) Physcomitrella patens subsp Patens (Moss)

282 (2.8 %) Arabidopsis thaliana (Thale cress)

218 (2.2 %) Oryza sativa Japonica Group (Rice)

153 (1.5 %) Zea mays (Maize)

58 (0.6 %) Oryza sativa Indica (Rice)

Trang 12

Homopolymer error

•  Different between signal of 1 and signal of 2 = 100%

•  Different between signal of 5 and 6 is 20% so errors more

likely after eg AAAAA

A ?c TT - AAAAA ?a

Trang 13

Roche software

•  Roche have developed Data-Analysis software for

processing, assembling and mapping the 454 reads:

•  sffinfo - extract fasta, quality and flowgrams as text from sff files

•  sfffile - join, split or trim sff files

•  gsAssembler (Newbler) - to assembly reads into contigs/isotigs

•  gsMapper - to map reads to a transcriptome or genome reference

•  gsAmplicon – to analyse Variants in Amplicons

•  (These run on 32 and 64 bit Linux There is information on the wiki about obtaining and installing these.)

Trang 14

Exercise 1A – sff files

Aims:

•  Using ‘sffinfo’ and ‘sfffile’

•  Summarise the read statistics

•  Blast the reads for contaminants

The exercises are on the wiki:

http://tinyurl.com/taw2010wiki

Trang 15

What is “Newbler” ?

  Roche's “GS De Novo Assembler” (where “GS” = “Genome Sequencer”)

  Designed to assemble reads from the Roche 454 sequencer

  Accepts:

  454 Flx Standard reads, and

  454 Titanium reads

  single and paired-end reads

  Optionally can include Sanger reads

  Initial versions focused on assembling Genomic reads

  Latest versions (2.3 and now 2.5.3) improve transcriptome assembly

  Runs on Linux, and has 32 bit and 64 bit versions

  Has Command-line and Java-based GUI interface

  Rarely called “Newbler” (for “New Assembler”) in Roche's

documentation, rather “runAssembler”, or “gsAssembler”

Trang 16

How does Newbler work?

cDNA  Reads  Alignments  Contig graph  Final untangled assembly

Trang 17

Inputs to Newbler assembler

Newbler accepts:

  Roche's sff files (standard flowgram format)

  Fasta files, with or without Quality files, such as Sanger

reads, (which can be used as a scaffolds.)

  Parameters specified by the user, to guide the assembly,

(or parameters can all be left at their default values.)

Trang 18

Command-line interface

•  The simplest command to run Newbler is:

runAssembly [options] reads.sff

•  Which creates an the assembly in an output directory called:

where P_ = Project, followed by date and time

•  There are a large number of optional parameters available for controlling and refining the assembly

Trang 19

Common command-line options

•  -cdna  for transcriptome (cDNA) assembly

•  -urt  ‘use read tips’ to produce longer isotigs

•  -o output_directory  to set name of output directory

•  -vt trimmingFile.fasta  to trim primers, adapters from

start or end of reads

•  -vs screeningFile.fasta  to remove reads that closely matching a cloning vector such as E.Coli or rRNA

•  (-vs and -vt also match reverse-complements of given sequences.)

Trang 20

Isogroups, Isotigs, Contigs ?

•  Some definitions to understand Newbler output:

•  An isogroup: - tries to represent a gene

- collection of isotigs containing reads that imply

connections between the isotigs

•  An Isotig: - represents an individual transcript

•  - different isotigs from a given isogroup can be inferred splice-variants

•  Contigs: - contigs forming an isotig may be thought of as exons

- this is not strictly correct, as untranslated regions (UTRs) and introns (in the case of primary transcripts) may exists in the reads generated from the sample

Trang 21

Isotigs - more details

•  Connections between contigs in an isogroup are represented by sequences (reads)

that have alignments diverging consistently towards two or more different

contigs or by a depth spike

•  The assembler trims and ignores any poly-A tails, so the true orientation of

reads in the assembly cannot be determined So an isotig may be output as the reverse-complement of the true biological transcript

•  For more details see pages 165 - 169 of the Roche software manual (which is on your computer’s Desktop in the ‘manual’ folder)

Trang 22

Output files for Transcriptome projects (1)

In the Assembly subdirectory:

•  454Isotigs.fna  fasta file of all Isotigs, and Contigs which are not in an isotig

•  454Isotigs.qual  quality scores (Phred-based) for each base in '454Isotigs.fna’ file (eg: 20 = 1 in 100 probability of incorrect base call; 50 = 1 in 100,000)

•  454Contigs.fna  fasta file of all contigs, which are used to create the Isotigs

•  454Contigs.qual  quality scores for each base

•  454NewblerMetrics.txt  statistics of the assembly, eg: number of reads and

bases aligned, overlaps found, mean contig sizes,

•  454ReadStatus.txt  status of each read in assembly (Assembled,

PartiallyAssembled, Singleton, TooShort, Outlier), and alignment 3' and 5' positions within contig

•  454TrimStatus.txt  each read's original and revised trim-points used in the

assembly

Trang 23

Output files (2)

•  454AlignmentInfo.tsv  base consensus and quality, read-depth and flow-signal,

at each position in each contig

•  Can easily be parsed by Perl script to obtain eg: average coverage depth for each contig and isotig

•  eg:

Position Consensus Quality Unique Align Signal Signal

Score Depth Depth StdDev

Trang 24

Output files (3)

•  454Contigs.ace = ACE format file, showing how reads were aligned

to form contigs, viewable in eg Tablet, or Consed

•  Unlike traditional ace files, in Newbler’s ace files:

•  the same read can be in several contigs (but is given an extra suffix),

eg: if one contig is in a repeat (higher coverage) region, and the next is contig is a non-repeat (low coverage) region, and the read spans the junction

•  a contig (and hence a read) can be shared between several isotigs

•  But a read should only be in one isogroup

Trang 25

Output files (4)

Only with -cdna option:

•  454IsotigLayout.txt  how contigs are laid along each isotig in the isogroup, (454RefLink

also gives which isotigs are in each isogroup)

•  eg:

>isogroup00003 numIsotigs=8 numContigs=11

Length : 495 508 142 171 251 308 98 61 61 566 306 (bp) Contig : 02209 02600 02782 00425 02597 00426 02119 02340 02624 02132 02630 Total: isotig00004 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 1484 isotig00005 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 1484 isotig00006 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 1497 isotig00007 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 1497 isotig00008 >>>>> >>>>> >>>>> >>>>> >>>>> 1472 isotig00009 >>>>> >>>>> >>>>> >>>>> >>>>> 1485 etc……

Trang 26

Exercise 1B - Assembly

Aims:

•  Assemble dataset with Newbler

•  Summarise the resulting isotigs

•  Look into the assembly output files

•  View the assembly in Tablet viewer (http://

bioinf.scri.ac.uk/tablet)

Trang 28

Obtaining Roche software

To obtain the latest Roche off-line Data_Analysis

software version 2.5.2 (which includes the sff tools, Newbler assembler, gsMapper and gsAmplicon),

complete the softwre request form on the 454.com website:

http://454.com/contact-us/software-request.asp

Trang 30

How does Newbler work?

•  Identify pairwise overlaps between reads

•  Construct multiple alignments of overlapping reads

•  Break the multiple alignments where consistent differences are found between

different sets of reads

•  This gives “contigs” that represent the assembled reads

•  Resolve branching structures between contigs, to generate isotigs

•  Generate consensus basecalls for the contigs using quality and flow signal

information at each base in the multiple alignments

•  Output the contig consensus sequences, quality scores, alignment and metric files

•  You will see message about these steps as assembly progesses

If paired End data is available, the assembler performs these extra steps:

•  Organize contigs into scaffolds, using paired-end information to order the contigs

and to approximate the distance between contigs

Trang 31

GUI interface to Newbler

•  gsAssembler = Roche’s graphical interface to the newbler

assembler Is based on java

•  Type: gsAssembler & (The ‘&’ just means can still use the command-console as runs assembler in’background’)

•  Set project name, directory, and Genomic or cDNA option

•  On Project tab, select directory containing sff files, then

uncheck any unwanted sff files

•  Set parameters for project, such as MINT adapters to trim, and ribosomal rRNA fasta file to screen out, other

assembler and output options

•  Click the “Start” button at the right, and watch the output at the bottom

•  When finished assembly, can view using the Results,

Alignment and Flowgrams tabs

Trang 32

Experiment 208: Using the GUI

Graphical interface should appear

•  Choose options and run the assembly

•  Look at the resulting assembly in the viewing tab

•  What do you think about the accuracy of the

assembly?

Trang 33

Roche's software also includes:

model organisms can specify file of known annotations and SNP's)

(eg rare alleles) in ultra deep coverage of regions of interest (see manual Part D on website for more information)

•  File Tools:

•  sffinfo  extract fasta, quality and flowgrams as text from sff

files

•  sfffile  join sff files; extract part of sff file by MIDs, read names

or random reads; or trim reads in user-defined ways

•  sff2scf  converts one read from sff file into an SCF file (or

performs “call throughs” to access SCF data for Sanger reads)

•  fnafile  Constructs a FASTA file (& quality file) from list of

FASTA, PHD and SCF files

Trang 34

Viewing Assemblies

•  In addition to the alignment viewer in gsAssembler, there are several other viewers for viewing the ace alignment files:

http://sourceforge.net/apps/mediawiki/amos/index.php? title=Hawkeye

http://bioinf.scri.ac.uk/tablet/

bioinformatics.bc.edu/marthlab/

From http://bioinformatics.zj.cn/magicviewer/

Trang 35

Videos about 454 sequencing

•  Pyrosequencing:

http://www.youtube.com/watch?v=kYAGFrbGl6E

•  Genome Sequencer FLX System Workflow:

http://www.youtube.com/watch?v=bFNjxKHP8Jc

Trang 36

Exercise 1: Look into an sff file

•  ‘sffinfo’ is a command-line program that is part of this

Roche Data_Analysis package

•  To view the binary sff file as text, run:

cd ~/data/Axolotl

sffinfo Axolotl_reads.sff | less

(Piping to less allows you to scroll easily)

Type ‘ q ’ to quit less

Trang 37

Exercise 2: Extract reads from an sff file

•  Use the file: Axolotl_reads.sff

•  Extract reads from the sff file into a fasta file:

sffinfo -seq Axolotl_reads.sff > Axolotl.fna

head Axolotl.fna

•  Extract the quality information from the sff file:

sffinfo -qual Axolotl_reads.sff > Axolotl.qual

head Axolotl.qual

•  Count the number of reads (The quotes are important):

grep -c ">" Axolotl.fna

Tiêu đề	Understanding and Assembling 454 Genome & Transcriptome Data
Tác giả	Stephen Bridgett
Chuyên ngành	Genomics
Thể loại	Assembly Training
Năm xuất bản	2011

Định dạng
Số trang	48
Dung lượng	583,62 KB