GENERAL INTRODUCTION
Scientific taxonomy and morphological description
Paphiopedilum Pfitz.1886, a prominent genus within the lady slipper orchid subfamily Cypripedioideae, belongs to the Orchidaceae family (Averyanov et al., 2004; Chase et al., 2007) Characterized by its unique flower structure featuring a deep pocket-shaped lip resembling a vibrant slipper, Paphiopedilum is commonly referred to as 'Lady's' or 'Venus's Slipper' orchids (Tran, 1998).
Figure 1.1 Flower morphology of Paphiopedilum dalatense species, one of the endemic species of Vietnam
Paphiopedilum species are herbaceous plants that thrive on the forest floor, primarily as terrestrial plants, while some are lithophytes and a few are true epiphytes These species typically feature very short rhizomes and stems, with the exception of P malipoense and P micranthum, which possess interconnected multi-rhizomes forming underground networks.
The plant features 3 to 7 distichous leaves that can be elliptic, oblong, or obovate in shape The leaf apex is typically round or acute, often exhibiting an asymmetrical bilobate or trilobate form Its base is conduplicate, forming a V-shape that encases the stem The leaf blade displays a uniform green color or a tessellated pattern of dark and pale green on the upper surface, which is characteristic of many species.
Paphiopedilum orchids These mosaic spots are thought to be due to the uneven distribution of chlorophylls on the leaf surface resulting in alternate light and dark spots
Certain species exhibit distinctive purple spots at their base or across the lower surface of their leaves The unique characteristics and density of blue mosaic spots on both the upper and lower leaf surfaces are specific to each species, making detailed observation essential for accurate identification.
80 Paphiopedilum species worldwide have been found from nature and through cross- breeding (Cribb, 1998; Averyanov et al., 2004).
Paphiopedilum population in Vietnam
The Vietnam Paphiopedilum population is categorized into two primary ecological groups: one inhabiting the limestone mountains of northern Vietnam, characterized by sun-shaded slopes and deep cracks, thriving at elevations of 400 to 1600 meters, and the other found in acid rock areas of silicate, granite, or rhyolite in Central and South Vietnam, particularly in provinces like Lam Dong and Khanh Hoa While most Paphiopedilum species prefer low light conditions, strap-leaf multifloras require medium light Ideal temperature ranges for these orchids are between 18.3°C to 23.9°C during the day and 12.8°C to 15.6°C at night Notably, Vietnam boasts a high density and biodiversity of Paphiopedilum species, making it a significant region for orchid diversity.
There are many references on Vietnamese Paphiopedilum species with different data (Appendix Table A.3)
Tran Hop in 1998 described 14 species (Tran, 1998), of which P bellatulum collected in Da Lat was never described as Vietnamese plant in the subsequent
The Paphiopedilum species is primarily found in China, Myanmar, and Thailand (Zhongjian et al., 2009) As a result, it is considered an imported cultivar and is excluded from our study.
In 2004, Averyanov and a team of Vietnamese plant experts conducted a comprehensive field survey, resulting in the publication of "Slipper Orchids of Vietnam," which accurately documented 18 original Paphiopedilum species, including 7 varieties (2 of P hirsutissimum, 3 of P malipoense, and 2 of P villosum), along with 9 natural hybrid species The book provides detailed descriptions covering morphology, distribution, ecology, and physiology, complemented by illustrative photographs.
Among the hybrids mentioned, only P x dalatense and P x herrmannii have comprehensive and reliable data regarding their distribution and hybrid status in natural populations The remaining species are only referenced through indirect descriptions, lacking sufficient evidence to confirm their presence in Vietnam.
P x aspersum is believed to be rarest slipper orchid species with very restricted range and sometimes is identified as a variety Paphiopedilum barbigerum var aspersum
In 2008, Averyanov again published the illustrated survey of Vietnam
The updated Paphiopedilum article reflects significant changes since the 2004 version, maintaining the original 18 species while introducing numerous new varieties Notable additions include three varieties of P callosum, two of P gratrixanum, two of P hirsutissimum, and three of P malipoense.
The list of hybrid species of P villosum has undergone significant changes since the 2004 version, with some hybrids being removed and new ones added (Appendix Table A.3) Notably, several of these hybrids were collected in Lam Dong (Da Lat) However, comprehensive data on their ecology and distribution remains limited, often relying on indirect descriptions.
P x aspesrum are data deficient and have seldom been found in Vietnam for years
(Averyanov et al., 2004) In 2018, a new variety of a natural hybrid of the genus
Paphiopedilum x aspersum var trantuananhii, a variety from Vietnam, was confirmed by researchers Olaf Gruss, Leonid V Averyanov, Chu Xuan Canh, and Nguyen Hoang Tuan in 2018 Previously, it was classified as the natural hybrid Paphiopedilum trantuanii, identified in 2007, resulting from the cross of P barbigerum and P henryanum, which is another name for Paphiopedilum x aspersum from 2002 This orchid is believed to be cross-pollinated by insects.
Natural hybridization among species in the genus is common, but confirming a natural hybrid species relies on the distribution and stability of its wild population In Vietnam, reliable statistics on Paphiopedilum hybrids are lacking, with only P x dalatense and P x herrmannii being well-known Sadly, the endemic species P x herrmannii is thought to be extinct in nature (Averyanov et al., 2004; Averyanov, 2008).
Figure 1.2 Flowers of some Vietnam Paphiopedilum species
Paphiopedilum armeniacum, a beautiful orchid species, is primarily found in Yunnan province, China While official scientific reports do not acknowledge its existence in Vietnam, numerous collectors have reported sightings of this species in the region.
5 and distribution in limestone mountains in Vietnam at the northern border with China and Myanmar (Following Vietnam Paphiopedilum (Period 1), 2012) Therefore, this species was also included in our study
In 2010, Paphiopedilum canhii was reported as a new species from northern
Vietnam (Averyanov et al., 2010) Then, the new subgenus Megastaminodium Braem
& Gruss 2012 was published to accommodate Paphiopedilum canhii (Braem and
Gruss, 2012) However, this species was immediately classified as being in danger and has little chance of surviving in nature
Vietnam is home to approximately 20 original species of Paphiopedilum, along with various natural hybrids; however, many of these taxa face the threat of extinction, with some no longer found in the wild (Averyanov et al., 2004) The IUCN Red List 2020-2022 indicates that 2 species are vulnerable, 7 are endangered, and 11 are critically endangered The rapid increase in uncontrolled sampling is driven by local communities and orchid traders collecting plants shortly after their discovery Effective conservation efforts are hindered by the complexity of maintaining natural populations To combat illegal trade, it is essential to establish protected areas and educate customs and quarantine inspectors on distinguishing rare species from common ones The challenge is exacerbated by the fact that most traded specimens are in their vegetative stage, which leads to high morphological similarities and misidentification Therefore, alongside basic morphological assessments, there is a pressing need for the development of more effective identification methods for Paphiopedilum species in Vietnam.
The selection of identification techniques
Common identification techniques
Morphological-based classification has traditionally been the most popular method for identifying different organisms (Averyanov et al., 2004) However, challenges arise when samples are immature or partially damaged, leading to difficulties in accurate identification To address these issues, scientists are increasingly turning to molecular classification techniques PCR-based methods, which require only a small sample size, offer a promising solution, although amplification techniques are most effective when applied to samples from a specific group.
Random Amplified Polymorphic DNA (RADP), Restriction Fragment Length Polymorphism (RFLP), and Amplified Fragment Length Polymorphism (AFLP) are techniques used to analyze genetic material (Abbas et al., 2017) Additionally, species-specific PCR can be employed to detect specific genes or proteins (Sun et al., 2011; Peyachoknagul et al., 2014; Kim et al., 2015b) However, these PCR-band methods are not effective for identifying unknown taxa.
The collection of genetic information is crucial for understanding the origins of diverse organisms worldwide and is vital for species protection, phylogenetic analysis, and the management of genetic diversity (Kress et al., 2005; Kim et al., 2014) Sequence-based methods in bioinformatics provide detailed insights by analyzing every nucleotide, allowing for the evaluation of genetic relationships with unknown samples through existing sequence libraries DNA polymorphism offers more information than proteins due to the complexity of the genetic code and extensive non-coding regions (Pereira et al., 2008) Consequently, short DNA fragments serve as unique identifiers, akin to human fingerprints, in a technique known as DNA barcoding (Hebert et al., 2003) This rapid and sensitive method enables researchers to identify species efficiently with minimal samples, making it particularly effective for conservation efforts in biodiversity hotspots and distinguishing medicinal species from adulterants (Mondini et al., 2009; Corrado, 2018; Lahaye et al., 2008; Vandergast et al., 2013; Lima et al., 2018; Asahina et al., 2010; Ghorbani et al., 2017b; Gao et al., 2019; Liu et al.).
Identification techniques using DNA sequences - DNA barcodes
The ‘DNA barcode’ concept was first introduced by Paul Heber in 2003 for
DNA barcoding is a powerful method for swiftly and accurately identifying various taxonomic species, encompassing all forms of life, including bacteria, fungi, plants, animals, and humans (Lahaye et al., 2008) This technique utilizes a specific segment of DNA that serves as a unique identifier for living taxa, allowing for differentiation even from minimal biological samples at any developmental stage (Parveen et al., 2012) The effectiveness of DNA barcodes relies on selecting appropriate DNA regions that can serve as reliable markers for species identification.
7 some main criteria (i) universality of amplification and sequencing, (ii) pattern of intraspecific vs interspecific variation and (iii) power to identify species (Hollingsworth et al., 2009)
The selection of a barcode locus is challenging due to the need for universal applicability across various taxa and the issue of sequence substitution saturation The 5' end of cytochrome c oxidase 1 (CO1) from the mitochondrial genome has been shown to effectively identify animal individuals In contrast, finding a DNA barcode for plants is more complex, as mitochondrial genes like CO1 exhibit a low rate of synonymous substitution, along with significant genomic structural rearrangements and the import of sequences from the nucleus and chloroplast Consequently, these genes are not recommended for plant DNA barcoding, prompting the search for universal sequences from nuclear and chloroplast genomes Numerous studies have reviewed the search and application of DNA barcoding, summarizing the efficiency of various biomarkers and discussing the effectiveness of different barcoding techniques, criteria, and measurements.
DNA barcoding involves extracting DNA from fresh samples to perform sequence calls Each organism group has a unique DNA sequence, serving as a barcode for accurate taxonomic identification.
The use of various sequence regions for plant identification has become prevalent, with the internal transcribed spacer (ITS) being the most commonly utilized (Xiang et al., 2011; Singh et al., 2012; Wu et al., 2012; Yukawa et al., 2013; Xu et al., 2015; Tran et al., 2018; Veldman et al., 2018) Notably, ITS2 exhibits greater variability than the full-length ITS, making it a viable alternative (Yu et al., 2017; Liu et al., 2019) The trnH-psbA locus has also been extensively studied, achieving up to 100% species resolution across 72 angiosperm genera (Kress et al., 2005) and within specific genera (Yao et al., 2009; Rajaram et al., 2019) Furthermore, combining multiple loci, including trnH-psbA, rpoB, rpoC1, and matK, has been recommended by several studies for enhanced identification (Chase et al., 2007; Gigot et al., 2007; Hollingsworth et al., 2009; Singh et al., 2012; Siripiyasing, 2012).
Next generation sequencing (NGS) – a new technique for long sequence
Next generation sequencing (NGS) enables the sequencing of an entire organism's genome in just one day, distinguishing it from traditional short sequence targeting methods Also referred to as massively parallel or deep sequencing, NGS has transformed genome research and the utilization of genomic data.
Recent advancements in next-generation sequencing (NGS) technology have significantly reduced the cost and time required for genome assembly, enabling the creation of a reliable, auto-correcting genome from long reads of fragmented DNA By utilizing bioinformatics tools to analyze and assemble data, researchers can efficiently reassemble genomes at a low cost Over the past decade, these technical improvements have expanded genome research beyond key organisms like rice, maize, and cotton, allowing for a broader application across various plant and animal species.
The NGS sequencing process involves three key steps: library preparation, sequencing, and data analysis Initially, total DNA is fragmented into short pieces, and adapters are attached to both ends to ensure compatibility with the sequencer These adapters contain complementary sequences that allow the cleaved DNA fragments to bind to the flow cell, where they undergo amplification and purification In the subsequent sequencing step, each DNA fragment is amplified, leading to the generation of numerous copies for analysis.
The process involves the creation of 9 million copies of single-stranded DNA, where chemically modified nucleotides attach to the DNA template strand through natural complementarity Each nucleotide features a fluorescent tag and a reversible terminator that prevents the addition of the next base The fluorescent signal reveals the added nucleotide, and once the terminator is cleaved, the next base can bind This method includes paired-end sequencing, where both forward and reverse DNA strands are read Following sequencing, a base calling process is performed to accurately identify nucleotides and assess the reliability of the base calls.
In this study, we also sequenced the complete chloroplast genome of one
Paphiopedilum species to answer the questions about the efficiency of species identification between short sequences and the whole genome.
Common bioinformatics analyses in identification techniques using DNA
The bioinformatic procedure for species identification using sequences involves two fundamental steps First, sequence alignment is performed, which can be categorized into two methods: novel sequences are pairwise aligned with existing sequences in databases (similarity search), or the studied sequences are aligned against each other within a specific taxonomic group (many-against-each-other search) Second, the comparative analysis of similarity or variation is conducted based on the alignment data to identify the target species, utilizing parameters such as GC% content for evaluation.
Various studies have utilized genetic distances, variable sites, indel appearances, and monophyletic clusters to identify the distinct characteristics of examined sequences These measurements can differ based on the alignment and distinguishing methods employed in each research (Table 1.1) Ultimately, this analysis aims to ascertain whether the species in question are distinct Some studies have calculated species resolution by comparing the number of identified species to the total number of examined species.
Table 1.1 List of common sequence-based identification methods and identification criteria
Alignment Methods Species Identification Methods Identification Criteria
Correct Identification Ambiguous Identification Incorrect Identification
Many-against-each other search
Correct Identification Ambiguous Identification Incorrect Identification
Many-against-each other search
Ambiguous Incorrect (mismatch) Best close match
Correct (match) Ambiguous Incorrect (mismatch)
No match (under Threshold (%) All species barcodes
Correct (match) Ambiguous Incorrect (mismatch)
Many-against-each other search Barcoding gap
Intra-specific genetic distance Inter-specific genetic distance Barcoding gap
Many-against-each other search Tree-based
Many-against-each other search
Variable sites, Indel Sequence length GC% contents
The BLAST (Basic Local Alignment Search Tool) method treats each sequence as a query and performs pairwise alignment with database sequences to identify the best similar sequences, known as best hits, from various databases such as GenBank and BOLD Correct identification occurs when the best hit matches the expected species name.
"Ambiguous Identification" occurs when the best matches in sequence analysis include hits from multiple species, including the target species, while "Incorrect Identification" refers to cases where the best match does not align with the expected species (Feng et al., 2015; Ghorbani et al., 2017a; Parveen et al., 2017; Rajaram et al., 2019) The reliability of the BLAST tool is contingent upon the presence of your species sequence in existing databases; if the sequence is not represented, it can lead to inaccurate identification results.
Table 1.2 Query identification criteria in sequence-based methods
Ambiguous/Unidentified Non-successful identification
Correct identification: the best BLAST hit of the query sequence is from the expected species
Ambiguous identification: the best BLAST hits for a query sequence were found to be those of several species including the expected species
Incorrect identification: the best BLAST hit of the query sequence is not from the expected species
Correct identification: the hit in our database based on the smallest genetic distances is from the same species as that of the query
Ambiguous identification: several hits from our database were found to have the same smallest genetic distance to the query sequence
Incorrect identification: the hit based on the smallest genetic distance is not from the expected species
Sequence(s) with smallest distance to query all conspecific
Ambiguous: Sequence(s) with smallest distance to query a mixture of allo- and conspecific sequences
Incorrect (mismatch): All sequence(s) with smallest distance to query allospecific
Sequence(s) with smallest distance to query all conspecific and within the 95 th percentile of all intraspecific distances
Ambiguous: Sequence(s) with smallest distance to query a mixture of allo- and conspecific sequences and within the 95 th percentile of all infraspecific distances (b) Nomatch (under Threshold
%): No match within 95 th percentile of all infraspecific distances
Incorrect (mismatch): All sequence(s) with smallest distance to query allospecific and within 95 th percentile of all infraspecific distances
As in "best match," but with all conspecific sequences topping the list of best matches
As in "best match", but with at least one allospecific sequence being more similar to query than the least similar conspecific sequence
Query with conspecific sequence in data set is more similar to all sequences from another species
Gap existence: The smallest inter-specific distances greater than the largest intra- specific distances (coalescent)
Overlap: The smallest inter- specific distances greater than the largest intra-specific distances
Query clusters with all conspecific sequences
Query without conspecific sequence in data set ("singleton")
Query with conspecific sequences in multiple clusters
Reciprocal monophyly: Members of each species share a unique common ancestor
Paraphyly: Members of the species are monophyletic, but nest within another recognized species
Polyphyly: Members of the species nest within other species branches
Genetic distance-based methods utilize best hit feedback akin to "BLAST" for evaluating Correct, Ambiguous, and Incorrect criteria These methods focus on the smallest genetic distances, referred to as "Nearest Distance" (Feng et al., 2015), or on the highest similarity, known as "Best match" or "Best close match" (Meier et al., 2006; Xiang et al., 2011).
In their studies, Xu et al (2015) and Guo et al (2016) compared each query against all other sequences in the dataset The term "Best close match" is distinct from "Best match" as it involves a similarity threshold, such as 95% of all intraspecific distances, to determine the required similarity for a barcode match Any results falling below this threshold are excluded from the identification process.
To estimate the threshold distance for a given data set, one must analyze the frequency distribution of intraspecific pairwise distances and identify the distance below which 95% of these distances fall For instance, if 95% of conspecific sequences exhibit pairwise distances under 1%, a query can only be identified if it matches a barcode in the 0% to 1% range Any queries lacking such a match will remain unidentified However, a significant limitation of this method is that the data set may not accurately represent the taxon being studied.
In the "All species barcodes" method, sequences are organized based on similarity to a query using the same criteria as the "Best close match" approach A species is identified if the query is accompanied by all conspecific barcodes from at least two taxa Conversely, if only one or a few conspecific barcodes are present, the identification remains ambiguous Additionally, if the query lacks conspecific barcodes and includes sequences from other species, misidentification may occur (Meier et al., 2006).
The barcoding gap method differs from distance-based approaches by analyzing the divergence between intra-specific and interspecific genetic distances of a query against other conspecific and hetero-specific sequences in the dataset Studies (Yao et al., 2009; Gao et al., 2010; Ginibun et al., 2010; Singh et al., 2012; Siripiyasing, 2012; Tanee et al., 2012; Feng et al., 2015; Chattopadhyay et al., 2017; Parveen et al., 2017; Rajaram et al., 2019) indicate that accurately assessing the barcoding gap is crucial, as using the mean inter-specific distances rather than the smallest can lead to imprecise results A clear barcoding gap suggests successful species identification.
Figure 1.4 Diagrams express the relationship between intra-specific genetic distance and inter-specific genetic distance (Meyer and Paulay, 2005)
The identification strategy aims to ensure that all conspecific sequences are grouped together, distinct from other species barcodes A commonly used approach for this classification is the tree-based method, which is both simple and visual (Asahina et al., 2010; Xiang et al., 2011; Wu et al., 2012; Trung et al., 2013; Kim et al., 2014; Givnish et al., 2015; Tang et al., 2015; Xu et al., 2015; Guo et al., 2016; Poovitha et al., 2016; Chattopadhyay et al., 2017; Ghorbani et al., 2017a; Parveen et al., 2017; Tran et al., 2018; Rajaram et al., 2019) This method effectively separates two species when their sequences are clustered into different branches, demonstrating a monophyletic relationship (Figure 1.5) (Meyer and Paulay, 2005).
In certain situations, nucleotide polymorphism methods, including variable sites, sequence length variation, indel information, and GC% content, are utilized as identification techniques (Shaw et al., 2007; Guo et al., 2012; Tanee et al., 2012; Kim et al., 2015b) This approach, known as the character-based method, treats each nucleotide as an additional character alongside the four traditional nucleotides: A, T, C, and G.
When employing various methods, it's crucial to be aware of potential issues to prevent errors Research indicates that small sample sizes can lead to inaccuracies (Gigot et al., 2007; Asahina et al., 2010; Parveen et al., 2012; Siripiyasing, 2012; Wu et al., 2012; Yukawa et al., 2013; Kim et al., 2015b; Rajaram et al., 2019), while having a broad range of taxonomic samples with lower confidence levels can also compromise the validity of findings (Lahaye).
14 et al., 2008; Chen et al., 2010; Chattopadhyay et al., 2017) may lead to overfitting (Meyer and Paulay, 2005)
Figure 1.5 Phylogenetic clusters are used to distinguish species (Meyer and Paulay,
This study focuses on developing DNA barcodes for the genetic identification and characterization of Paphiopedilum species in Vietnam through DNA sequencing The research aims to identify specific markers for accurately detecting orchid germplasm and assessing genetic diversity, utilizing tree-based methods and analyzing nucleotide polymorphism characteristics.
Research on sequence identification of Orchidaceae species
Numerous studies on plant barcoding have been conducted, but current sequencing techniques, particularly next-generation sequencing (NGS), are largely restricted to agricultural crops due to their high costs and time requirements (Ray and Satya, 2014; Thottathil et al., 2016; Li et al., 2018; Zhou et al., 2018a) As a result, shorter sequences remain a practical and efficient option for quick and accurate species identification.
DNA barcoding has been effectively utilized across various taxa of the Orchidaceae family A study conducted this year recommended using combinations of two or three loci, specifically matK, rpoC1, rpoB, and trnH-psbA, for 11 Mesoamerican orchid species The findings revealed that the combination of matK, rpoC1, and rpoB achieved a 100% resolution rate, successfully discriminating all 11 species Additionally, the combinations of matK with rpoC1 and trnH-psbA, as well as matK with rpoB and trnH-psbA, demonstrated a 90.9% resolution rate, identifying 10 out of 11 species, while the single locus matK also achieved a 90.9% success rate.
MATERIALS AND METHODS
DNA extraction and amplification and sequencing for short sequences
Total DNA from fresh leaves was extracted using Isolate II Plant DNA kit BIO-
In Ho Chi Minh City, Vietnam, TBR Company utilized DNA stored in TE solution at -20 °C as a template for the amplification process, using 100 ng per 50 µL reaction volume.
PCR reaction components included 10 àl Taq DNA poly 2X - premix (0.1 U /
To prepare a PCR reaction, combine 20 mL of Taq DNA Polymerase with 0.4 mM each of dATP, dGTP, dCTP, and dTTP, along with 4 mM MgSO4, 20 mM KCl, 16 mM (NH4)2SO4, and 20 mM Tris-HCl at pH 8 Additionally, include 1 µL of forward primer, 1 µL of reverse primer, and 1 µL of template DNA at a concentration of 50 ng/µL Finally, add distilled water to reach a total reaction volume of 25 µL.
The thermal cycling protocol consisted of an initial DNA pre-denaturation step at 94 °C for 10 minutes, followed by 30 cycles of 30 seconds of denaturation at 94 °C, 30 seconds of annealing at varying temperatures (Ta °C) specific to each primer pair, and 40 seconds of extension at 72 °C This was concluded with a final elongation step of 5 minutes at 72 °C, utilizing the SimpliAmp™ Thermal Cycler A24811 from Thermo Fisher Scientific The specific annealing temperatures and primer details for amplifying the ITS, matK, and trnL regions are provided in Table 2.2.
Table 2.2 Primers used for amplification reactions in the study
900 (Tsai, 2011) IT2–R GTAAGTTTCTTCTCCTCC trnH-psbA 53 psbA3’f CGCGCATGGTGGTTCACAATCC
(Hollingsworth et al., 2009) trnHf GTTATGCATGAACGTAATGCTC rpoB 53 2F ATGCAACGTCAAGCAGTTCC
R1326–mo TCTAGCACACGAAAGTCGA trnL 62 trnL–F GGTAGAGCTACGACTTGATT trnL–R CGGTATTGACATGTAAAATGGGACT 600
The quality of PCR products was assessed through electrophoresis, ensuring the presence of a distinct, clear band in a 1% agarose gel Unpurified PCR products, measuring 40 µL and exhibiting a bright, thick, and single band, were subsequently sent for further analysis.
1 st BASE company (Singapore) for Sanger sequencing on both forward and reverse directions The primers used for sequencing were the same as those in the PCR reactions (Table 2.2).
Sequence adjustment and alignment
Raw sequences were trimmed off ambiguous ends using FinchTV (Geospiza,
In 2004, the reliability and accuracy of raw data were ensured by comparing forward and reverse sequences prior to the creation of the consensus DNA sequence using Seaview 4.0 software (Gouy et al., 2009) Subsequently, all consensus sequences were submitted for further analysis.
GenBank accession numbers are listed in Appendix A1 The alignments were initially processed automatically with Seaview software (Gouy et al., 2009) and subsequently refined manually, focusing on the highly divergent non-coding regions of the chloroplast genome and the coding regions of the nuclear genome, which contain numerous indel fragments.
Sequence analysis
Sequences from the study were checked for accuracy by comparing it with reference sequences from the GenBank database using BLAST (Basic Local Alignment Search Tool)
Using MEGA7 software (Kumar et al., 2008), we calculated genetic characteristics such as conserved regions, polymorphism sites, and specific genetic distances We manually recorded nucleotide substitutions, including variation, parsimony, singleton, mono-indel, insertion, and deletion parameters Additionally, unique variable-site characters, or species-specific SNPs (Single Nucleotide Polymorphisms), play a crucial role in differentiating one species from another.
Tree-based construction utilized various phylogenetic methods, including the neighbor-joining (NJ) method via MEGA7, Maximum Likelihood (ML) and Maximum Parsimony (MP) methods using PAUP* 4.0, and Bayesian Inference (BA) in the MRBAYES program Tree rooting was achieved through the outgroup method, while nucleotide substitution models for each phylogenetic analysis were determined using the jModeltest program The optimal models identified for the sequences ITS, matK, trnL, and matK + ITS were K80 + G and TIM1 + I.
The proposed models TPM1uf + I and TPM1uf + G for each DNA locus were analyzed using PAUP* and MRBAYES programs For MEGA analysis, the Kimura-2-parameter model was employed, and bootstrap resampling with 1000 iterations was conducted to estimate reliability The resulting phylogenetic tree topology was visualized using the Figtree v1.4.3 software (Rambaut, 2009).
Evaluation of species resolution
The species resolution was estimated mostly based on the tree-based method, in combination with the polymorphism character-based method
Firstly, species with just only one accession was distinguishable if the sequence
The unique characteristics of accession 22 are highlighted in the phylogenetic tree, where it appears as a distinct monophyletic branch (Hollingsworth et al., 2009) When multiple accessions are collected from a single species, they are grouped together into one monophyletic branch Conversely, if conspecific individuals are found in separate paraphyletic branches, this indicates a failure in species identification Additionally, a detailed analysis of insertions, deletions, and repeats is essential, as these differences can aid in authenticating the accessions (Shaw et al., 2007; Wu et al.).
In 2013, it was observed that different species were grouped within the same branch of the phylogenetic tree due to indiscrimination by K2P distance, indicating that the sequences of these accessions were identical Furthermore, the presence of indel information suggested that these hetero-species sequences were more frequently observed within the same branch.
New primer designing
Primers for 5 loci ACO, LEAFY, atpB-rbcL, matK, trnL specific for
The development of Vietnamese Paphiopedilum involved two key steps Initially, researchers searched for primers to amplify three specific loci based on published articles These primer sequences were then aligned with reference sequences from GenBank Paphiopedilum accessions relevant to the study Using Seaview 4.0, identical nucleotide sites between the primers and alignment sequences were identified, leading to the exclusion of primers that did not match or showed significant discrepancies with Paphiopedilum sequences Primers with minor differences of 0-2 nucleotides were retained for further testing with Primer3Plus, which aimed to optimize the annealing temperature (Ta) between forward and reverse primers while minimizing issues related to hairpin structures, repeats, and dimers.
If suitable primers for Paphiopedilum were unavailable, new primers were designed using the Primer3Plus program A representative sequence from each region served as a template for primer design The proposed primers were then analyzed in Seaview software to assess their alignment, conserved characteristics, and the length of the amplified products.
All the selected or designed primers were tested for amplification on 54 samples in the study The PCR products were tested for expression in gel agarose 1 %
Vegetative morphological identification
Photographs and measurements of the studied samples were recorded for both intra- and inter-specific morphological analysis, focusing on vegetative characters, particularly leaf morphology Key criteria included blade shape (oblong, elliptic), size, tip, midrib, vein structure, leaf margin, base, color (upperside and underside), surface texture (rough-glossy/smooth), cilia, thickness, toughness, and direction Each characteristic was meticulously observed and documented, referencing the monograph on Paphiopedilum (Tran, 1998; Averyanov et al., 2004) for accuracy The species were described comparatively, highlighting the most notable differences first Observed variables, encompassing qualitative features such as leaf shape, color, and thickness, as well as quantitative leaf size measurements, were organized using Microsoft Excel 2010 for further analysis.
Methods for plastomic analyses
2.3.7.1 DNA extraction for Next generation sequencing (NGS)
The thesis focused on analyzing short sequences for target identification To obtain these mini-barcodes, moderate-quality extracted DNA was used for amplification and Sanger sequencing Additionally, the efficiency of a super barcode was tested against the short barcodes using voucher sample DEL_2.
The Paphiopedilum delenatii species underwent whole genome analysis through Next Generation Sequencing, necessitating high-quality total DNA However, the use of commercial extraction kits yielded an inadequate quantity of DNA for quality testing, library construction, and sequencing Consequently, a manual DNA extraction method was implemented to obtain the required DNA quality and quantity.
0.2 g leaf was ground with 5 àl proteinase K, 3ml of a mixture of beta-mer + extract buffer at 65 o C then incubated for 30 minutes at 65 o C The sample was added with 600 àl P:C:I and centrifuged for 10 minutes at 10000 rpm After adding 5ul of RNAse and incubated at 37 o C, the sample was added with 600ul C:I DNA was precipitated by isopropanol and incubated overnight at -20 o C The pellet, obtained by centrifugation, was washed with 70%, 80%, 90% ethanol DNA has suspended in 25 àl TE and stored at -20 o C
The quality of DNA for NGS sequencing requires high purification, with an OD260/280 value between 1.8 and 2.2, no RNA contamination, and minimal DNA degradation The DNA concentration must exceed 20 ng/µL, with a sample quantity of at least 300 ng and a minimum DNA volume of 10 µL in EB buffer Purity is assessed using a NanoDrop 2000, while DNA integrity and concentration are evaluated through electrophoresis on a 0.8% agarose gel in TBE 0.5X solution, observed under fluorescent light High-quality DNA bands appear bright, thick, and above 10 Kb, indicating a high concentration with less degradation Additional concentration measurements are performed using NanoDrop 2000 and Quantus E6150 from Promega.
GENEWIZ in South Plainfield, NJ, USA, conducted the library construction and whole-genome sequencing of P delenatii The sequencing utilized an Illumina HiSeq platform with a 2x150 pair-end (PE) configuration.
2.3.7.2 Read data processing and chloroplast genome assembly
Demultiplexing was conducted using bcl2fastq 2.17 The raw data underwent filtering, which involved discarding paired-end reads that contained adapters, had over 10% N bases in either read, or exhibited a low-quality base ratio (Q