Objectives Application of bioinformatics in assembling, annotating and drawing complete chloroplast genome ofthe endemic species Paphiopedilum delenatii in Vietnam.. 1 1.1.3 Ecological
Trang 1NGUYEN TAT THANH UNIVERSITY
True learning, true practice, true success, true future
FACULTY OF BIOTECHNOLOGY
GRADUATION THESIS
OF Paphiopedilum delenatii (Guillaumin 1924)
Ho Chi Minh City, 2019
Trang 2NGUYEN TAT THANH UNIVERSITY
FACULTY OF BIOTECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM
Independence - Liberty - Happiness - oOo -
ASSIGNED TASK OF GRADUATION THESIS
Full name: Nguyen Thanh Diem Student ID: 1511540870
1 Thesis’s title:
Application of bioinformatics in assembling, annotating and
1924) in Vietnam
2 Objectives
Application of bioinformatics in assembling, annotating and drawing
complete chloroplast genome ofthe endemic species Paphiopedilum delenatii
in Vietnam
3 Contents
- Extraction and quality assessment ofDNA
- Raw data quality assessment
- Filtering out low quality sequences
- Genome assembly
- Genome annotation from contigs
- Construct genome map from annotated genome
4 Execution time: from February/2018 to August/2019
5 Supervisor: MSc Vu Thi Huyen Trang
Contents and requirements of thisthesis were adopted by subject
HCM city, ,2019
Trang 3To complete this graduation thesis, I would like thanks to my parents for always
facilitating and encouraging me during my studies Sincere thanks to MSc Vu Thi
Huyen Trang - Biotechnology Faculty - Nguyen Tat Thanh University for her whole
heartedly guiding and creating all conditions forme during the process of thesis
Sincerely thanks to:
- Nguyen Tat Thanh University’s boardof directors has created all conditions for
me during my studies at the school
- All teachers of Biotechnology Faculty - Nguyen Tat Thanh Unversity have
taught me during the course at the school
- My whole 15DSH1B classhas helped me during my studies
Trang 4TABLE OF CONTENTS
ACKNOWLEDGEMENTS i
TABLE OF CONTENTS ii
ABSTRACT V LIST OF FIGURES vi
LIST OF TABLES vii
LIST OF ACRONYMS viii
INTRODUCTION ix
CHAPTER 1 LITERATURE REVIEW 1
1.1 Introductionof Paphiopedilum delenatii 1
1.1.1 Position ofPaphiopedilum delenatii in the classification system 1
1.1.2 Morphological characteristics 1
1.1.3 Ecological characteristics 3
1.1.4 Individual status of Paphiopedilum in nature 3
1.2 Introduction of chloroplast genome 3
1.2.1 Introductionof chloroplast 3
1.2.2 Introductionof chloroplast genome 3
1.3 Introduction of bioinformatics and genome assembly 4
1.3.1 Application of Bioinformatics in genome assembly 4
1.3.2 Overviewofgenome annotation 6
1.3.3 Overviewofgenome map 7
Trang 51.5 Related scientific research projects 9
1.5.1 Domesticprojects 9
1.5.2 Foreign projects 10
CHAPTER 2 CONTENTS AND METHODS 11
2.1 Place ofadministration 11
1.4.3 Chlorobox website 11
2.2 Contents 11
2.3 Methods 11
2.3.1 Materials 11
2.3.2 DNAextraction 11
2.3.3 Qualityassessment of DNA 12
2.3.4 Qualityassessment ofrawdata 12
2.3.5 Genome assembly 13
2.3.6 Genome annotation 14
2.3.7 Genome map construction 14
CHAPTER 3 RESULTS 15
3.1 Results of DNA extraction 15
3.2 Raw data assessment 16
3.3 The results of genome assembly 18
3.4 The accuracy of genome structure 20
3.5 Genome annotation 21
3.6 The genome map construction 23
CHAPTER 4 DISCUSSION 24
4.1 Raw data assessment 24
4.2 Phred quality score as input data for assembly 25
Trang 64.3 The length ofK-mer substring 25
4.4 Thereference sequence (refseq) and seed sequence (seed) 26
4.5 Genome annotation 27
4.6 Genome map 29
CONCLUSIONS AND RECOMMENDATIONS 30
REFERENCES 31
Trang 7Chloroplasts and mitochondria are organelles that have their own genome The
chloroplast genome provides information on the evolutionary relationship and identification of species, transgenic, cloning plants, etc. Next Generation Sequencing
makes it easier for chloroplast genome sequencing However, the assembly process of
chloroplast genome is quite complicated due to the need of different complex
bioinformatics tools, high configuration computer and time Here we configured theprocess of assembling the chloroplast genome of Paphiopedilum delenatii. The
assembled chloroplast genome was 160955 bp in length, including a large and a small single copy region (LSC, SSC) separated by a pair of inverted repeats (IR) Totalgenes were 130 genes, GC content was 35,6% Genome data was mapped and
registered to GenBank as MK463585 The optimal parameters for genome assembling
were recommended This study not only provided information for conservation of
Vietnam endemic Paphiopedilum delenatii species but also supported the genome
assemble researcheswhich could be applied on othersubjects
Trang 8LIST OF FIGURES
Figure 1.1 Morphology of Paphiopedilum delenatii 2
Figure 1.2 Genome assembly steps 5
Figure 1.3 De Bruijin graph 6
Figure 1.4 Tire assembled chloroplast genome of p armeniacum 7
Figure 1.5 Genome map of Citrus aurantiifolia 8
Figure 3.1 The result of two p delenatii samples electrophoresis on0.8 % 16
Figure 3.2 Quality assessment of raw data using FastQC 17
Figure 3.3 The BLAST result of two DNA sequence with reference genome of p armeniacum (KT388109.1) 20
Figure 3.4 The complete chloroplast genome map of Paphiopedilum delenatii 23
Trang 9LIST OF TABLES
Table 2.1 The length of reference sequences onNCBI 13Table 2.2 The survey parameters were set forgenome assembly process 13Table 3.1 The results of OD and concentration of two p delenatii samples with
Nanodrop and Quantus 15Table 3.2 The reults ofgenome assembly 19
Table 3.3 List of annotated genes inchloroplast genome of p delenatii 22Table 4.1 Evaluation of the correlation between quality score andreliability ratio 24Table 4.2 Comparing p. delenatii andP armeniacum genomes 28
Trang 10LIST OF ACRONYMS
NDH : NADH dehydrogenase - like complex
NCBI : National Center for Biotechnology Information
IUCN : International Union for Conservation of Nature
Trang 11Locating the role of genes in the genome is one of the important contributors to species and genetic conservation One of the methods currently used is DNA barcode
which is quickly and accurately identify species, to help distinguish rare species from
other common species DNA barcoding technique use short sequences for
identification Recently some researchers suggested the use of complete genome asmeta-barcode to increase the dectection efficiency Among the genome in cell,chloroplast genome is small and has high diversity hence was sequenced in many studies Genetic understanding of the genome can help to track the evolution,
determine purity, identify species and improve cloning processes, ect
Previously, it was difficult to sequencing complete genome The introductionof
Next Generation Sequencing (NGS) technology is a revolution in sequencing Thistechnology can collect the huge data with lowcost So more chloroplast genomes were
sequenced for genomic functional studies Together with NGS sequencing,
bioinformatic tools can help to extract chloroplast sequences from total DNA fragments to assemble the complete chloroplast genome without step of chloroplast
isolation first
Vietnam is a country with diverse flora Especially Vietnam owns many rare
orchids, including Slipper orchids Their flower has shape like a slipper and has a
characteristic aroma Slipper orchids are often distributed in dangerous places, increasing the value of plant Paphiopedilutn delenatii, which is an endemic orchid
species of Vietnam, is listed as Critically Endangered (CR) by IUCN, hence posesthe urgent need ofconservation measures The genome information of this species is still limited until now
From the above reasons, we implemented the thesis “Application of bioinformatics in assembling, annotating and drawing complete chloroplast
Vietnam”.
Trang 122 Objectives
Sequencing, assembling, annotating and drawing complete chloroplast genome
of the endemic species Paphiopedilum delenatii in Vietnam
Trang 13Chapter 1 Literature review
CHAPTER 1 LITERATURE REVIEW
Species: Paphiopedilum delenatii
1.1.2 Morphological characteristics
Paphiopedilum delenatii, which is a terrestrial perennial herb, has 5-7 leaves
grows into 2 rows The leaves of plant are shaped from oval to oblong and have 3
small teeth at thetip The leaves are nearthe root with hair at the edges On theleavesare mosaic of light and dark green spots on the surface and the purple thick spots on
the underside
The flower cluster usually have 1 - 2 flowers, some rare cases can be up to 3flowers The stem of the flower cluster is 22 cm,green, purple spots and covered withhard with hair Flowers is pink and have a diameter of 7.5 - 8 cm with purplish pink
lips, redspots and covered with hair The shape of lips from oval to globular and have
small hair '
Trang 14Chapter 1 Literature review
a- Flowering plant; b-flower;c -backsepals; d - sepals; e -petals; f,g - lip,
viewedfrom thefront,side and vertical section; h, i,j -stamens,viewedfromthe side and
behind
Trang 15Chapter 1 Literature review
p delenatii, which is an endemic species in Vietnam, distribute mainly in the
province of DakLak and Khanh Hoa, the eastern side of Hon Giao and Bi Dup
mountain Plants is usuallyfound at 750 - 1300 meters regions and growin primeval
forests on granite and gneiss floors, p delenatii is adapted to moist soil with good
drainage, where there are rocky slopes rich in humus and limestone cliffs This is aplant that lives in a shade environment
The fruit are long, dry and ripen after 6-10 months of pollination Seeds are
very light and easily spread by the wind Orchid’s seeds don’t have endosperm, sonutrient supply depends on symbiotic relationships with some root fungi In mature trees, buds bearing leaves and flower will bloom in the next year that have developed during the previous breeding season, p delenatii has a life cycle of 7 -8 years andthe
time of flowering is inDecember (https://www.iucnredlist.org/)
1.1.4 Individual status of Paphiopedilum in nature
At the time offlowering, flowers ofp delenatii are easilydamaged by the rain,
so it is hard to create seeds In addition, the plants are very sensitive to theenvironment such as altitude, nutrient sources, etc The habitat of plants is narrowed
by the effects of climate change, deforestation, soil erosion and exploitation Currently,
p delenatii is listed CR and the number is decreasing The number of matureindividuals is estimated at 200 - 250 individuals (https://www.iucnredlist.org/)
1.2 Introduction of chloroplast genome
1.2.1 Introduction of chloroplast
Chloroplast (cp) is an important organelle in plant and algae cells It performs
photosynthetic function to convert sunlight energy into carbohydrates and release
oxygen Metabolitessynthesizedin chloroplast are important forplant interaction with
the environment (reactingto temperature, light, etc) and resistance topathogens 2
1.2.2 Introduction of chloroplast genome
The chloroplast genome consists of circular DNA molecule containing two Inverted Repeat regions (IR), a Large Single copy region (LSC) and a Small Single
Trang 16Chapter 1 Literature review
copy (SSC) The chloroplast genome is between 115 - 165 kbp and has arepeatability of 1000 - 10000 copies 4 The cp genome of rice has 134,503 bpincluding LSC (80,548 bp), ssc (12,347 bp), and IRs (20,804 bp) with 120 genes 5.Because the sequence is highly conserved, small in size, less affected by recombination and inherited by the mother So the cp genome has been sequenced From these data can be used to build phylogenetic tree, applied in DNA barcode to identify molecules 6
The first cp genome is sequenced in 1986, on tobacco (Nicotiana tabacumỴ
Today, 2357 cp genomes have been provided on the database of National Center forBiotechnology Information The information fromcp genome have greatlycontributed
to the study of plant diversity The cp genome has contributedto studyofphylogenetic
studies and so vied the evolutionary relationships of plant In addition, the cp genomesequence showed differences between the sequence of nucleotides andthe structural ofgenome These data are valuable for the study of climate adaptations of crops,
identification and conservation of valuable traits The understanding of cp genome
also contributed to shedding light on the relationship between genomes of cp,mitochondria and nuclei in plants7
1.3.1 Application of Bioinformatics in genome assembly.
1.3.2.1 Overview of genome assembly
In DNAsequencing process, DNA is cleaved into several short DNA segments Therefore, it is necessary to have a stage to assemble these DNA segments into acomplete DNA sequence Genome assembly is based onthe overlap between the reads;
the contigs are then created from these segments, from which a complete DNA
sequence is constructed
Trang 17Chapter 1 Literature review
However, the assembly of genome is a complicated process because differentcomplex bioinformatics tools, high configuration computer and time are required In
genome assembly, the ideal result would be the formation of aunique contig; however,
in reality, itcan be difficult to obtain this result Instead, several short contigs would
be acquired
1.3.2.2 De Bruijin graph
De Bruijin graph shows the overlap between sequences ofthe same length The
overlapping level is indicated by a unit called k-mer8 When k-mer value is changed,
the sample sequence is cut into short nucleotide sequences, whose length depends on
the modifiedk-mer value De Bruijin graph is considered as asimple, time-saving and effective method in genome assembly
Trang 18Chapter 1 Literature review
Discard like 4 mers and align the rest
The output reads How the reads align
Create all the possible 4-mers
r
V* ACTC CTCG
1.3.2 Overview of genome annotation
Genome annotation is the process of indicating name, location, structure andfunctions of the genes within a genome At first, this term was used only to specifyfunctional regions in the genome (mRNA) However, genome annotation was then expandedto not only focusing on functional regions, but also looking at other different regions 9 Genome annotation can be divided into three basic categories, including nucleotide level, protein level, and process level l0
Genome annotation is an extremely important stage ingenomesequencing After
the complete assembly of the genome, the collected data is raw reads and not
biologically significant Annotating a genome is essential for analysis and explanation
of genetic factors, which would then provides information on biological proccesses
The diversity of functional factors found in human genome 11 and biochemical function ofgene segments 12 have been proven bygenome annotation This step is not
only beneficial for genome sequencing research but it also has an important role in Genomics study
Trang 19Chapter 1 Literature review
1.3.3 Overview of genome map
Genome map is used to identify and recordthe location, direction, and distances
between genetic factors Those factors are termed markers and can consist of genesand non-coding regions Genome map can be considered as a fundamental step inorder to locate harmful genes, which has a huge effect on genetic engineering, such ascleaving, ligation, and gene silencing Furthermore, comparison of genome map
between different species can demonstrate the conservative regions of the genes and
provide information on location ofunidentified genes based on other known species High-precision genome maps play an important role in genome assembly, ensuring
continuity and avoiding gaps in the assembly l3
Trang 20Chapter 1 Literature review
1.4.1 FastQC program
FastQC is a qualitycontrol analysis tool written by Simon Andrew of BabrahamBioinformatics (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) This is avery popular tool used to provide an overview of basic qualitycontrol metrics forrawnext generation sequencing data After the raw data is analyzed by FastQC, there will
be an html file that may be viewed in the browser In addition to the graphical or list
data provided by each module, a flag of “Passed”, “Warn” or “Fail” is assigned
However, the users should be very cautious when using those flags for assessing
sequence data A “Warn” or “Fail” flag from a module result does not necessarily
mean that the sequence run failed “Warn” and “Fail” flags mean that the users must
stop and consider what that results mean in the context ofthat particular sample and
the type of sequencing that was run
Trang 21Chapter 1 Literature review
1.4.2 NOVOPlasty program
NOVOPlasty is a free software specifically designed to assemble the genome of
organelles and is capable of providing complete gene sequences within 30 minutes.The program is written in Perl programming language and requires no software or
support modules to run the program l4 All data of Illumina's genome used for
assembly are compatible with NOVOPlasty Incomplete assembled sequences due to
low coverage regions (low GC) can be solved by increasing higher coverage (up to 1000X or higher) However, increasing the coverage will slow down the assemblyprocess and will need more memory The program's algorithm enables the program toassemble sequences from sequence areas that have problems, such as sequences with
high AT percentage After completing the assembly, NOVOPlasty will export the
contigs, which will then be automatically arranged in order to facilitate the assembly
of a complete genome
1.5.1 Domestic projects
In 2015, Huynh Phuoc Hai and Nguyen Van Hoa at An Giang University
developed an improved chloroplast genome assembly process in order to sequence
chloroplast genomefromraw data without using areference Thisstudyevaluated four data sets of Arabidopsis Thailand, Oryzasativa Indica, Sorghum Bicolor, and
Lenconten. The results showed that this procedure gave 94.4 % to 98.8 % accuracycompared to the original gene This is an enhanced process compared to othertraditional methods This was considered as Metabarcoding because the mapping of sequence with reference sequences canbe by passed l7
The Vietnamese genome sequencing project of Dang Thanh Hai and his colleagues was carried out in 2015, showing aclose relationship between Kinh people
and Japanese and Chinese It was noticed that despite the close relationship with Chinese people, the Kinh genome was quite similar to that of the Europeans and Africans l8
Trang 22Chapter 1 Literature review
1.5.2 Foreign projects
In 2015, Choun-Sea Lin and colleagues conducted a project about the position
and the change in location of ndh gene in chloroplast genome ofsome orchids This
project was carried out on a number of orchid species, including Vanilla planifolia,
Paphiopedilum armeniacum, Paphiopedilum niveum, Cypripedium formosanum,
Habenaria longidenticulata, Goodyera furnata and Masdevallia picturata After being
sequenced, the orchid sequences were assembled into chloroplast genomes, which
were followed by the construction of phylogenetic tree and then the identification of
the number of ndh genes Based on the change of location ofndh gene, scientists will
be able to identify mutations for development of a wide variety of orchids |Ọ
Ofall plants, most species with whole genome sequences were crops, especially
rice In 2016, Zhiqiang Wu and his colleagues sequenced the chloroplast genome of
wild rice (Oryza australien sis). Yeisoo Yu and his colleagues sequenced the
chloroplast genome of rice variety Nagina-22 in 2017 From the results of this study,the genetic resources have been enriched to support breeding of rice in later
generations ’20
Currently, many chloroplast genomes of different types of orchids have been
sequenced Among the species from the Paphiopedilum genus, only 4 species p.
armeniacum, p niveum, and p dianthum have had thecomplete annotated chloroplast
genome up to now. p delenatii is at risk of extinction, however, its chloroplast
genome has not been sequenced The chloroplast genome sequencing enablesscientiststo study and preserve the precious gene pool ofPaphiopedilum genus