genomic insights into divergence and dual domestication of cultivated allotetraploid cottons

barbadense accessions, including wild species, races, landraces, and modern improved cul-tivars, from different geographic locations, representing the long history of cotton domesticatio

Trang 1

R E S E A R C H Open Access

Genomic insights into divergence and dual

domestication of cultivated allotetraploid

cottons

Lei Fang1†, Hao Gong2†, Yan Hu1†, Chunxiao Liu1†, Baoliang Zhou1, Tao Huang2, Yangkun Wang1, Shuqi Chen1, David D Fang3, Xiongming Du4, Hong Chen5, Jiedan Chen1, Sen Wang1, Qiong Wang1, Qun Wan1, Bingliang Liu1, Mengqiao Pan1, Lijing Chang1, Huaitong Wu1, Gaofu Mei1, Dan Xiang1, Xinghe Li1, Caiping Cai1, Xiefei Zhu1,

Z Jeffrey Chen1,6, Bin Han2, Xiaoya Chen7, Wangzhen Guo1, Tianzhen Zhang1,8*and Xuehui Huang2,9*

Abstract

Background: Cotton has been cultivated and used to make fabrics for at least 7000 years Two allotetraploid species of great commercial importance, Gossypium hirsutum and Gossypium barbadense, were domesticated after polyploidization and are cultivated worldwide Although the overall genetic diversity between these two cultivated species has been studied with limited accessions, their population structure and genetic variations remain largely unknown

Results: We resequence the genomes of 147 cotton accessions, including diverse wild relatives, landraces, and modern cultivars, and construct a comprehensive variation map to provide genomic insights into the divergence and dual domestication of these two important cultivated tetraploid cotton species Phylogenetic analysis shows two divergent groups for G hirsutum and G barbadense, suggesting a dual domestication processes in tetraploid cottons In spite of the strong genetic divergence, a small number of interspecific reciprocal introgression events are found between these species and the introgression pattern is significantly biased towards the gene flow from G hirsutum into G barbadense

We identify selective sweeps, some of which are associated with relatively highly expressed genes for fiber

development and seed germination

Conclusions: We report a comprehensive analysis of the evolution and domestication history of allotetraploid cottons based

on the whole genomic variation between G hirsutum and G barbadense and between wild accessions and modern cultivars These results provide genomic bases for improving cotton production and for further evolution analysis of polyploid crops Keywords: Allotetraploid cottons, Resequencing, Divergence, Domestication

Background

Cotton (Gossypium spp.) is the most important natural

fiber and edible oil crop in the world The genus

Gossy-pium includes around 45 diploid (2n = 2x = 26) and five

allotetraploid (2n = 4x = 52) species The allotetraploids

that were present 1–1.5 million years ago (MYA)

originated from one hybridization event between an ex-tant progenitor of Gossypium herbaceum (A1) or Gossy-pium arboreum(A2) and another progenitor, Gossypium raimondii Ulbrich (D5) [1–3] Gossypium wild relatives grew primarily as perennial upright shrubs or small trees and existed in various stages of domestication as feral derivatives that had established self-perpetuating popula-tions in human-modified environments such as road sides, field edges, and dooryards [4] Cotton is a unique example of crop domestication that occurred in two Old World diploids, G herbaceum L and G arboreum

L and two New World allotetraploids, Gossypium hirsutum and Gossypium barbadense, in four different pre-historical cultures [4] Under long-term human

* Correspondence: cotton@njau.edu.cn; xhhuang@shnu.edu.cn

†Equal contributors

1 State Key Laboratory of Crop Genetics and Germplasm Enhancement,

Cotton Hybrid R & D Engineering Center (the Ministry of Education), Nanjing

Agricultural University, Nanjing 210095, China

2 National Center for Gene Research, Institute of Plant Physiology and

Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of

Sciences, Shanghai 200233, China

Full list of author information is available at the end of the article

© The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made The Creative Commons Public Domain Dedication waiver

Trang 2

selection of a wide range of morphological and

physio-logical traits, the two tetraploid species, G hirsutum

and G barbadense, have been domesticated and

culti-vated However, photoperiod sensitivity in long-lived

perennial species with a slow rate of plant

develop-ment and seed emergence and the broad spectrum of

fruiting habits in cultivars have been under

investi-gated [5–7]

Modern G hirsutum cultivars (Upland cotton) have

high-yield properties and dominate more than 90% of

worldwide cotton production, while G barbadense,

characterized by its extra-long staple (ELS) and strong

and fine fibers accounts for less than 10% [8] G

hirsu-tum is native to the Mesoamerican and the Caribbean

regions, and G barbadense is indigenous to the coastal

areas of Peru [9, 10] Through intensive study of

germ-plasm collections, Hutchinson [11] identified one wild

and six domesticated (not botanical varieties) races of

G hirsutum based mainly on their morphologies and

distinct geographic distributions Modern Upland

cot-ton has been further improved in the southern United

States from domesticated early-cropping perennials

through extensive human selection to produce a

com-mon set of agronomic features known as

“domestica-tion syndrome” traits [12] These traits include an

annual growth habit and photoperiod insensitivity [5],

decreased seed dormancy [6], a large boll size and

num-ber per plant [1], and superior finum-ber quality [13] The

genetic diversity of allotetraploid cottons has been

studied for decades using pedigree information and

morphologies [14, 15], biochemical markers [7, 16], and

DNA-based markers [17–20] Genomic insights into

variation within and between allotetraploid cotton

species are limited by the lack of known allotetraploid

genome sequences To resolve this, we resequenced

and conducted genomic analysis of 147 cotton accessions

with different origins after sequencing the genome of the

genetic standard Upland cotton line, TM-1 [21] Until

now, only a few candidate genes related to cotton lint yield

and fiber quality have been functionally characterized So,

we integrated the expression profiling data, quantitative

trait loci (QTL) mapping, and function annotations with

orthologs in Arabidopsis to conduct rapid identification of genes associated with domestication, especially fiber de-velopment and seed germination The present research provides genome-wide level insights into genetic diver-gence and dual domestication of cultivated tetraploid cottons

Results and discussion

Genetic diversity

Upland and Sea Island varieties were established in the seaboard colonies of the southeastern United States by the mid-18th century and the Egyptian cottons in the Nile Delta by the early 19th century So, we sampled

147 G hirsutum and G barbadense accessions, including wild species, races, landraces, and modern improved cul-tivars, from different geographic locations, representing the long history of cotton domestication and breeding throughout the world (Table 1; Additional file 1: Figure S1; Additional file 2: Table S1) Close relatives of the allo-tetraploid cotton species, Gossypium tomentosum (AD)3, Gossypium mustelinum (AD)4, and Gossypium darwinii (AD)5, as well as Thespesia populneoides (Roxb.) Kostelas, which is closely related to the genus Gossypium in the Malvaceae family, were all included as outgroups We resequenced all 147 accessions with approximately fivefold coverage, generating a total of 1.8 terabases of raw se-quence data, and aligned the reads to the reference gen-ome sequence of TM-1 [21] to identify sequence variants (Table 1) We used direct genome sequence comparison and PCR-based sequencing strategies to validate the qual-ity of the called single nucleotide polymorphisms (SNPs) Two recently sequenced accessions of G barbadense cv Xinhai 21 (XH21) [22] and G hirsutum acc.TM-1 [21] in our sequence panel were used as controls We checked the called SNPs from our sequence panel against two as-sembled genome sequences and found the accuracy of SNP calling to be 96.2% for XH21 and 99.1% for TM-1, with a low missing data rate (6.8%) We further randomly selected 68 SNPs to carry out PCR-based sequencing in

11 accessions, each randomly selected from one cluster of the phylogenetic tree constructed with 147 accessions, and found that the accuracy was 95.0% (Additional file 1:

Table 1 Summary of sequencing of and variations in G hirsutum and G barbadense

to the A subgenome

Uniquely mapping rate

to the D subgenome

a

Others includes four G barbadense races, Kaiyuanlihemumian, Yuanmoulihemumian, Alabolihemumian, and Kaiyuanlianhemumian, and close relatives Thespesia

Trang 3

Figure S2; Additional file 3: Table S2; Additional file 4:

Table S3) Therefore, the quality should be reliable

enough for follow-up phylogenetic and population

genetic analyses

Of the sequenced reads, 36.2 and 23.9% were uniquely

mapped to the A and D subgenomes of the TM-1

refer-ence genome (1.9-Gb oriented scaffold), respectively

(Table 1) Additionally, 10.5% of the total reads were

mapped to the A subgenome unoriented scaffolds and

1.9% of the total reads were mapped to the D

subge-nome unoriented scaffolds; we did not use these A or D

subgenome unoriented scaffolds for further analysis

Moreover, 23.4% of the total reads were mapped to

no or multiple locations, which may be caused by the

high proportion of repeated sequence (67.2%) or the

highly homoeologous regions between the A and D

subgenome in cotton Only 4.1% of the total reads

were mapped to the unclassified scaffolds, which had

little effect on our analysis

Overall, we identified 16,377,749 non-unique SNPs,

defined as those with the variant occurring in at least two

accessions and 144,662 non-unique indels (1 bp–8 kb;

Additional file 5: Table S4) Of these indels, 16,879 with

>50-bp indels were identified as structural variants (SVs;

Additional file 1: Figure S3; Additional file 6: Table S5)

For instance, the SV (2992 bp) identified in chromosome

D09 from 44,118,172 to 44,121,164 bp could be detected

in 37 accessions These variants were distributed across all

26 chromosomes, with an average density of 8.5 SNPs per

kilobase (Additional file 7: Table S6) The SNP density in

the A subgenome (9.2 SNPs per kilobase) was higher than

that in the D subgenome (7.4 SNPs per kilobase) By

analyzing the allele frequency of each SNP site in the 147

accessions, we identified 7,993,856 common SNPs, each

with an allele frequency of >5%, including 3,203,112

intra-specific SNPs in G hirsutum, 3,770,221 in G barbadense,

and 2,752,128 (~34.4%) nearly fixed interspecific SNPs

(SNPs with an allele frequency of >95% in G hirsutum or

G barbadense and <5% in the other species (Additional

file 1: Figure S4)

Dual domestication of cultivated allotetraploid cottons

The whole-genome SNP data were used to investigate

the phylogenetic relationships between all allotetraploid

cotton collections (Fig 1a; Additional file 8: Dataset 1)

The subsequently produced neighbor-joining (NJ) tree

resulted in two largely divergent clades: the G hirsutum

clade (n = 85) and the G barbadense clade (n = 52),

con-sistent with a previous study, although with a limited

number of accessions [23] Both studies suggested a strong

divergence between G hirsutum and G barbadense

Model-based analyses of population structure using

STRUCTURE revealed that there were two different

com-ponents between G barbadense and G hirsutum when K

(the number of populations modeled) was set to 2 How-ever, when K was set to 3, there were three different com-ponents: G barbadense cultivars, G hirsutum cultivars, and G hirsutum races (Fig 1b) This model-based result, along with that from principal component analysis (Fig 1c), agreed well with the pattern in the phylogenetic tree The outgroup type comprised ten accessions in total, including T populneoides, G tomentosum (Hawaiian Islands), G darwinii (Galapagos Islands), and seven tetra-ploid accessions that might have resulted from genetic in-trogressions from wild progenitors or from historical interspecific crossing between G hirsutum and G barba-dense (Additional file 2: Table S1) No clear separation existed between the seven races (33 accessions in total) in

G hirsutum, which was likely due to human-mediated ac-cession expansion, bringing formerly isolated races into mixed and overlapping distributions (Fig 1d) However, one punctatum race from Egypt and one latifolium race from Chiapas were most closely related to G hirsutum cultivars (Fig 1; Additional file 8: Dataset 1) Some African and Indian cultivars were classified into one landrace sub-group, which was closely related to the true annual forms

of punctatum grown in Africa, further supporting the early cropping of race punctatum in the Old World [11] Punctatum is a race originally found inland on the Yuca-tan peninsula Whether annual forms of punctatum were developed before or after its introduction into Africa re-mains to be explored These genomic data revealed at least two origins of upland cotton in the Old World and the New World; punctatum in America or Africa and lati-folium in America, consistent with the domestication and improvement history of upland cotton [1, 2, 11, 18] The origins of modern cultivated G barbadense are complex and somewhat obscure Unlike G hirsutum, which exists in both wild and cultivated states, G barba-denseis found only in cultivars The present research pro-vides genomic evidence that G barbadense is indigenous

to Peru and Brazil since those primitive landraces of G barbadense native to Brazil and Peru together with West

of the Andes and Sea Island cotton were classified into one subgroup (Fig 1e) It suggests a probable center of origin in northwestern South America, consistent with archeological records [24] All modern ELS cultivars were classified into three subgroups: Egyptian, American Pima, and Central Asia cottons

Genomic divergence between G hirsutum and G

barbadense

Much of the genetic diversity of cotton can be quantified

by the frequency of SNPs In addition to 322,285 coding-region SNPs (cSNPs) and 173,334 intronic-coding-region SNPs involved in 56,401 predicted genes [21], the majority (93.8%) of the 7,993,856 common SNPs were located in intergenic regions (Additional file 1: Figure S5) The allele

Trang 4

frequency distributions of 44,250 nearly fixed cSNPs were

highly diverged between G hirsutum and G barbadense

The number of nearly fixed cSNPs detected between 33

race accessions and 52 cultivars in G hirsutum was 1179

(Additional file 1: Figure S6) The sequence divergence at

the evolution level among accessions was further

evalu-ated using the ratio of nonsynonymous (Ka) SNPs against

synonymous (Ks) SNPs The average Ka/Ks ratio was 0.49 for all common cSNPs However, for 561 genes with nucleotide-binding site leucine-rich repeat domains, the ratio (0.73) was relatively higher, suggesting these genes are evolving more rapidly in response to co-evolving path-ogens The Ka/Ks ratios for the nearly fixed cSNPs were 0.57 between G hirsutum and G barbadense, and 0.91

Fig 1 Phylogenetic relationships of 147 cotton accessions a A neighbor-joining tree was constructed using whole-genome SNP data The cotton samples were divided into G hirsutum races (orange), G hirsutum cultivars (green), G barbadense cultivars (dark blue) and outgroup species (light blue) b Population structure of cotton accessions determined using STRUCTURE The accessions were divided into three groups when K = 3.

c Principal component analysis of all cotton accessions using whole-genome SNP data d Phylogenetic relationships between G hirsutum cultivars and races e Phylogenetic relationships between G barbadense landraces and cultivars The scale bar indicates the simple matching distance

Trang 5

between races and modern cultivars of G hirsutum,

indi-cating the existence of higher selection pressure during

upland cotton domestication from wild to dooryard types

and then field production

We also identified 5784 protein-coding genes with

pre-mature stop codons or frameshifts resulting from 6661

SNPs and 2047 indels A frameshift mutation occurred in

a total of 1447 protein-coding genes resulting from 2047

indels (Additional file 9: Table S7) Of these, we found a

flowering-related gene, Gh_D02G1411, homologous to

ABA OVERLY SENSITIVE 4 (AtABO4, AT1G08260) in

Arabidopsis The abo4-1 plants were early flowering with

lower expression of FLOWER LOCUS C and higher

ex-pression of FLOWER LOCUS T and changed histone

modifications in these two loci [25] Another interesting

indel-containing gene encoding a cell wall-loosening

pro-tein, expansin A8 (EXPA8), played an important role in

determining the rate and temporal period of fiber

elong-ation and further fiber quality improvement [26]

We examined the genetic diversity across the 26

chro-mosomes (Additional file 10: Table S8), and a strong

sig-nal of differentiation was observed at the whole genome

level between G hirsutum and G barbadense accessions

(Fig 2 chromosomes A01 and D01 displayed as

exam-ples and Additional file 1: Figure S7) The fixation index

values (FST) were 0.63 and 0.65 in the A and D

subge-nomes, respectively, which were slightly higher than that

between indica and japonica rice subspecies (FST= 0.55)

[27] and much higher than that between G hirsutum

races and cultivars (FST= 0.10 for both subgenomes)

Whole-genome analysis identified 109 selective sweeps

that spanned 3.4% of the G hirsutum cotton genome

through the comparison of 33 accessions of seven races

and 52 modern cultivars (πrace/πcultivar> 25; Fig 3;

Additional file 11: Table S9) We investigated the

gen-omic variation of G barbadense at the 109 selective

sweep regions identified in G hirsutum Compared with

the sequence diversity at the whole genome level, the G

barbadensepopulation did not show a significant change

at the 109 selective sweeps (πsweep=0.00055 versus π

ge-nome =0.00056), indicating different selection pressures

on the G hirsutum and G barbadense genomes These

genomic data further support our previous view that the

two species were domesticated independently [1, 8] The

phenomenon is similar to the dual domestication

pro-cesses in common beans, where two divergent populations

of Phaseolus vulgaris were independently domesticated in

Mesoamerica and South America [28], as well as in

culti-vated rice, where Oryza sativa and Oryza glaberrima were

independently domesticated in Asia and Africa [29]

G hirsutum and G barbadense had similar levels of

sequence diversity The nucleotide diversity levels of the

A and D subgenomes were 0.00075 and 0.00073,

respectively, in G hirsutum, and 0.00061 and 0.00051,

respectively, in G barbadense It is possible that these numbers have been underestimated because tetraploid cotton genomes have large proportions of repetitive se-quences and paralogs [21] similar to those in other large-genome plants such as maize [30] To provide an indication of the mapping resolution in genome-wide association studies, the decay rate of linkage disequilib-rium (LD) was calculated The average pairwise correl-ation coefficient (r2) dropped from 0.6 at 1 kb to 0.3 at

1000 kb in G hirsutum This slow LD decay might have resulted from inbreeding nature in cotton Moreover, as expected, a slower LD decay rate was found in cultivars than in the wild species and primitive races (Additional file 1: Figure S8)

Asymmetric introgression between G hirsutum and G barbadense

In spite of the strong genetic divergence between G hir-sutum and G barbadense, the interspecific hybrids of the two cultivated species are fertile and grow vigor-ously, and some F1 hybrids are commercially produced [31] Cotton breeders have worked diligently to intro-duce some desired alleles from one species to another in order to increase genetic diversity To analyze introgres-sion between tetraploid cottons, a recently developed “3-population test” method [32, 33] was used for modeling Among all possible scenarios, we found evidence of intro-gression events between G hirsutum races and G barba-densecultivars (f3 =−0.1223, Z score = −253.4; Additional file 12: Table S10) These introgression events were suc-cessfully traced using the population-scale genomic data generated in the present study (Additional file 1: Figure S9) On average, 0.2% genomic regions in 137 accessions (excluding the ten outgroup accessions) showed obvious introgression events (384 introgression events detected in

at least two accessions) (Additional file 13: Table S11) Intriguingly, the introgression events were significantly biased towards the gene flow from G hirsutum into G barbadense than that from G barbadense into G hirsu-tum (265 versus 119, Fisher’s exact test, P = 8.04E-08; Fig 2; Additional file 14: Dataset 2) Moreover, more introgression events were found in the A subgenome (250) than in the D subgenome (134) (Fisher’s exact test,

P= 2.29E-05) A previous study described interspecies introgression in a limited population of 11 G hirsutum and three G barbadense [23]; however, the researchers used two diploid progenitor genomes [34, 35] instead of two published tetraploid genomes [21, 36] as the refer-ence Many structure variations have occurred after the formation of tetraploid cotton compared to two corre-sponding progenitors From our previous colinearity ana-lysis, the overall gene order and colinearity were largely conserved between our A and D subgenomes [21] and the

D progenitor genome [34], but this colinearity was not

Trang 6

Fig 2 Characterization of the genetic diversity and introgression on chromosomes A01 and D01 in cotton The levels of genetic diversity in G hirsutum cultivars ( π Gh cultivar) (a) and races ( π Gh race) (b), the level of genetic diversity in G barbadense ( π Gb cultivar) (c), and the level of genetic differentiation between G hirsutum and G barbadense (d) For introgression analysis, the genetic backgrounds of G hirsutum cultivars, G hirsutum races, and G barbadense cultivars are illustrated in green (a), orange (b) and blue (c), respectively

Fig 3 Identification and comparative analysis of the selective sweeps in G hirsutum The values of πrace/ πcultivar were plotted against the position

on each of the 26 chromosomes The relationships between each selective sweep and its corresponding homologous region in the allotetraploid genome are indicated by grey lines The 12 selective sweep pairs with high or modest selection signals in homoeologous regions are indicated

by red lines The blue arrow indicates the fiber quality related QTLs around the strongest selection signal locus in D11 and the longest selection region in A06 The red arrow indicates the POX and ACS1 genes in the A08/D08 and A12/D12 homoeologous regions

Trang 7

obvious between our A and D subgenomes and the A

pro-genitor genome [35], partly due to numerous examples of

mis-assemblies in the A progenitor genome, as we

re-ported before [21], and partly because G arboreum is an

important cultivated diploid species and may have

under-gone some of its own chromosomal rearrangements

dur-ing its evolution and improvement Additionally, a larger

population in the present study will be helpful to identify

the introgression event more comprehensively compared

to a previous study that used limited samples [23]

Across the allotetraploid cotton genomes, we found 11

regions of extensive introgressions, with the greatest

density in chromosome A1 (Fig 2; Additional file 14:

Dataset 2; Additional file 15: Table S12) Analysis of

QTLs has provided genetic evidence that these regions

were associated with fiber quality traits (Additional file

16: Table S13) We observed 169 introgression events

from six primitive races of G hirsutum into Sea Island

cottons of the G barbadense species, such as Coastland

R4-4, Seabrook, and West of Andes, instead of Tanquis,

whose fiber was medium staple (23.8 to 27.0 mm in

length) and was coarse This fiber performance of the

landraces such as Tanquis is typified by current cottons

of Peru, where the ancestral G barbadense originated

[9] Genomic evidence from the present study reveals

subsequent introgressions from the local wild G

hirsu-tum or races into G barbadense during its movement

northward through inland Mesoamerica, from the

Yucatan peninsula to the Caribbean Islands, where Sea

Island cotton originally formed and was then introduced

to the coastal states of the southeastern United States

(Additional file 1: Figure S10) No introgression evens

occurred from richmondi to Sea Island cottons, probably

because of restricted geographical positions along the

Pacific side of the Isthmus of Tehuantepec or limited

collected accessions

Among these 169 introgression events from G hirsutum

races into G barbadense accessions, four events observed

in Giza36, Giza80, Pima S-1, and Pima S-2 were detected

in the same introgression region, the ChrA10.57.block

(Additional file 14: Dataset 2) This block overlaps a QTL

for fiber length (qFL-A10-2) [37] In this block, we

anno-tated 11 genes, of which five were potentially related to

seed and fiber development, mainly involved in auxin

transport (auxin efflux carrier gene) [38], transcription

fac-tors (WD40 repeat-like superfamily genes) [39, 40], and

carbohydrate metabolism (o-fucosyltransferase gene,

su-crose phosphate synthase gene, and beta galactosidase)

[41] (Additional file 17: Table S14) In the

ChrA11.88.-block, which is also an introgression region from G

hirsu-tum races into four central Asia type G barbadense

accessions (CCCP1243, XH 3, XH 11, and XH 29), at least

nine of 27 genes are potentially related to disease

resist-ance, including two TIR-NBS-LRR genes [42], five pectin

methylesterase inhibitor genes [43], and two dirigent-like protein genes [44] (Additional file 14: Dataset 2; Additional file 17: Table S14) We found 1061 genes in

169 introgression events from G hirsutum races to G barbadenseand 665 genes in 96 events from G hirsutum cultivars to G barbadense Interestingly, the genes in the former were enriched in developmental processes, such as reproduction, epithelial cell development, and cell prolif-eration, possibly allowing the allopolyploid to survive and even thrive considering its wide adaption In contrast, the latter genes were enriched in cellular homeostasis, fatty acid oxidation, and lipid catabolic processes (Additional file 18: Table S15) In the 119 introgression events from G barbadense to G hirsutum, we further found 587 genes enriched in lipid metabolic and carbohydrate metabolic processes (Additional file 18: Table S15) These results support the idea that such introgressions confer beneficial traits such as fiber quality and photoperiod neutrality and are responsible for the creation of the Sea Island cotton germplasm, as reported previously [5, 9, 12, 20, 31]

In spite of a low introgression rate, some G barba-dense segments were found to be introgressed into G hirsutum races (Additional file 14: Dataset 2) These interspecific gene flows might have occurred during the northward movement of G barbadense (Additional file 1: Figure S10)

Modern Egyptian-type ELS cultivars showed genomic signatures of G hirsutum race introgressions in chromo-some A1 (81–84 Mb, 88–89 Mb), A10 (21–22 Mb, 56–

57 Mb), and D11 (10–11 Mb); the American-Pima type

in A1 (77–78 Mb, 84–89 Mb) and A10 (56–57 Mb); and the Central Asia type in D1 (42–44 Mb), D9 (3–4 Mb, 5–6 Mb, 49–50 Mb), D10 (6–7 Mb, 57–62 Mb), and D11 (11–16 Mb, 63–64 Mb) (Additional file 13: Table S11; Additional file 14: Dataset 2), suggesting a distinct improvement in the Central Asian type ELS cultivars Some introgression events, such as those in chromo-some A1, were previously reported using restriction fragment length polymorphism markers [20], in which the G hirsutum allele was found in 48 (94%) of the 51

G barbadense collections, including Egyptian and Pima cottons Furthermore, modern breeding has enhanced gene flow and post-domestication introgressions through deliberate hybridization between these two species For example, targeted introgressions from G barbadense cultivars have been used to develop Acala cultivars, which improved upland cotton’s fiber quality and Verti-cilliumresistance [45]

Signatures of selection and adaptive trait associations in

G hirsutum

The genetic diversity in modern cultivars was found to be low (πcultivar= 0.00074)—only 34.2% (32.4 and 35.0% for the A and D subgenomes, respectively) of that in races

Trang 8

(πrace= 0.00216)—indicating a strong genetic bottleneck

during upland cotton domestication This diversity level is

close to that in japonica rice (33%) [27] and much lower

than that in maize (83%) [46] and indica rice (75%) [27]

Phylogenetic analysis of the 109 selective sweeps

re-vealed a strong selection pressure in nearly all cultivars

of G hirsutum The average selection signal (πrace/π

culti-var= 32.8) in the A subgenome was close to that in the D

subgenome (πrace/πcultivar= 35.0), but the sweep regions

between the A and D subgenomes were largely different

These selective sweeps domesticated for fiber yield and

fiber qualities provide a resource for molecular breeding

of G barbadense in the future

Interestingly, 12 homoeologous pairs of selective sweeps

with high or modest selection signals (πrace/πcultivarranging

from 15.4 to 39.6) were detected between the A and D

subgenomes (Fig 3), probably due to selection of a

com-mon set of domestication genes For example, peroxidase

genes (POX, Gh_A08G0711/Gh_D08G0829) and ACC

synthase genes (ACS1, Gh_A12G0969/Gh_D12G1017)

participating in ethylene biosynthesis were co-selected

within the overlapped regions of the selective sweeps of

the A08/D08 and A12/D12 homoeologous pairs, and these

genes play key roles in fiber elongation [47, 48]

To investigate the contribution of selective sweeps in

the domestication for fiber yield and fiber qualities in G

hirsutum, the overlap between selective sweeps and

QTLs of various agronomic traits was further examined

A total of 211 fiber quality- and lint yield-related QTLs

were around 67 selective sweeps (Additional file 19:

Table S16) The locus associated with the strongest

selection signal (πrace/πcultivar= 100.0) was located on

chromosome D11 and overlapped with several QTLs

controlling fiber length (Fig 3; Additional file 19: Table

S16) Another strong selective sweep was located on

chromosome A6, covering a very long genomic interval

(21.6 Mb) that overlapped QTLs for fiber length and lint

percentage (Additional file 19: Table S16) Fiber length

and lint yield have greatly increased during

domestica-tion from wild type, primitive races, and advanced types

to modern cultivars

The examination of gene expression in selective

sweeps responsible for various agronomic trait QTLs

in-dicated some casual genes may be related to this

domes-tication Of the 1058 genes identified in all 109 selective

sweeps, 723 were expressed in fiber development stages

Additionally, 236 of these 723 genes had significantly

higher expression levels during fiber development in

do-mesticated cotton (TM-1) than those in two wild relatives

(TX665, G hirsutum var palmeri and TX2094, G

hirsu-tumvar yucatanense) (Additional file 20: Table S17)

Using RNA-seq data from multiple tissues, we found

that the proportions of genes that were expressed during

fiber development and seed germination were higher in

the selective sweeps than in the whole genome (Additional file 1: Figure S11) Within the selective sweeps, 76 fiber- and 115 seed germination-related genes (Additional file 21: Table S18; Additional file 22: Table S19; Additional file 23: Table S20) were identified based

on their expression profiles Ten of these 76 genes were expressed at significantly higher level in TM-1 than in pal-meri and yucatanense races (Fig 4) For instance, a cytoki-nin oxidase gene (CKX6, Gh_D04G0688) was associated with increased fiber and seed yield [49]; a fatty acid desa-turase (FAD3, Gh_07G0946) was required for the specific membrane structure of fiber cells and genes encoding very long chain fatty acid (VLCFA) synthase for fiber cell elong-ation [47, 48] These results suggest potential roles in the improved fiber qualities of domesticated cotton

Of the 115 seed germination-related genes, Gene Ontol-ogy analysis showed an enrichment for genes involved in biological processes related to histone methylation and ethylene signaling pathways, which are required for the positive regulation of seed dormancy [50] (Additional file 23: Table S20) For instance, the gene encoding an AP2/ethylene response factor (ERF1, Gh_D10G1537) was found in the selective sweeps The loss-of-function ap2 mutant showed increased seed mass relative to the wild type in Arabidopsis [51] Overexpression of OsERF1

in Arabidopsis up-regulated the expression of two known ethylene-responsive genes, leading to short hypocotyls/ roots and the production of fewer seeds or no siliques at all [52] Another gene, Gh_A10G0771, was homologous

to a RING E3 ubiquitin ligase in Arabidopsis, which regu-lated the stability of the cyclin-dependent kinase inhibitor KRP1 and further negatively regulated the cell number and seed size [53, 54] KRP1 is the target of the ubiquitin-proteasome pathway recently found to play an important part in plant seed size determination [55, 56] However, the molecular mechanisms of antagonistic function in the complex regulation of seed dormancy are still unclear The candidate genes identified in the selective sweeps are valuable for future functional analyses of seed dormancy reduction during domestication

Conclusions

Resequencing and genome-wide analysis of diverse G hirsutum and G barbadense wild accessions and mod-ern cultivars have provided a comprehensive genome-wide assessment of a fiber crop and enabled us to better understand the evolution, diversity, and domestication of allotetraploid cottons Strong genomic divergence be-tween G hirsutum and G barbadense led to dual do-mestication events of these two cultivated species, while reciprocal, but asymmetric, introgression between them has greatly improved their productivity and fiber quality Although both are commonly grown as fiber crops, they have been domesticated or improved toward different

Trang 9

breeding goals: G hirsutum for its high yield and wide

adaptation, and G barbadense for its superior fiber

qual-ity This large amount of new genomic resources will

substantially improve genetic mapping, gene

identifica-tion, and molecular breeding in cotton Specifically,

under the guidance of sequence information, the

favor-able alleles that are associated with high yield potential

and wide adaptation in G hirsutum and with fiber

qual-ity in G barbadense can be introgressed between the

gene pools to further improve cotton production

Methods

Sampling

In order to represent the rich genetic diversity and wide

geographical distribution of cotton, we selected seven

geo-graphical races of G hirsutum (“marie-galante’,

“puncta-tum”, “richmond”, “morrilli”, “palmeri”, “latifolium”, and

“yucatanense”) [11], a variety of G hirsutum cultivars,

including four major types—Acala, Delta, Plains, and Eastern—from the USA, and other domesticated subtypes from Brazil, India, Africa, and China Furthermore, G bar-badense cultivars, including American Pima, Egyptian, Peruvian Tanquis, and other subtypes from Russia and China, were also sampled Although extant wild G barba-densepopulations have been reported in Guayas and Los Rios in Ecuador and Tumbes in Peru [31, 57], the search for truly wild accessions is complicated since the wild-to-domesticated continuum in G barbadense does not have obvious categorical distinctions Another three wild allote-traploid species, G darwinii, native to the Galapagos Islands, G tomentosum from the Hawaiian Islands, and G mustelinum, an uncommon species restricted to a rela-tively small region of northeast Brazil, as well as Thespesia populneoides (Roxb.) Kostelas, a species in the mallow family (Malvaceae) closely related to the cotton genus (Gossypium), were chosen to form an outgroup Detailed

Fig 4 Expression pattern of ten fiber-related genes related to cotton fiber quality a Expression pattern of ten genes related to fiber quality in distinct tissues b Expression level of these genes in domesticated cotton (TM-1) and two wild relatives (TX665, G hirsutum var palmeri; TX2094, G hirsutum var yucatanense) Fisher ’s exact test, * P value <0.05 and fold change >2 c Identification of 109 selective sweeps through comparisons of races and cultivars in G hirsutum The values of π race/ π cultivar were plotted against the position on each of the 26 chromosomes The horizontal line indicates the genome-wide threshold of selection signals ( π race/ π cultivar >25) Asterisks indicate the strongest selection signal locus in D11 and the longest selective sweep in A06 Lines linking b and c indicate the gene locus in selection sweeps The fiber quality-related QTLs around these gene loci are shown beside these lines

Trang 10

information on the 147 cotton accessions is listed in

Additional file 2: Table S1

Library construction and sequencing

For each cotton accession, young leaf tissues from a

sin-gle plant were collected for genomic DNA extraction

using a standard cetyl trimethylammonium bromide

(CTAB) protocol [58] Paired-end sequencing libraries

with insert sizes ranging from 300 to 500 bp were

con-structed according to the manufacturer’s instructions

(Illumina, San Diego, CA, USA) All libraries were

se-quenced on the Illumina HiSeq 2000 platform A total of

1.8 terabases of genomic sequence data was also

gener-ated with an average 5× genome coverage for each

cot-ton accession

Genotype calling and SNP identification

All sequence reads were aligned against the reference

genome sequence (G hirsutum cv TM-1) [21] using

Smalt software (version 0.57, http://www.sanger.ac.uk/

resources/software/smalt/) The parameter for the read

mapping was “smalt_x86_64 map -i 700 -j 50 -m 60”

For the oriented 1.9-Gb genome sequence, 36.2% of

reads were mapped to the A subgenome and 23.9% to

the D subgenome for G hirsutum Additionally, 10.5%

of the total reads were mapped to the A subgenome

scaffold (326.3 Mb); 1.9% of the total reads were mapped

to the D subgenome scaffold (61.5 Mb); 4.1% of the total

reads were mapped to unclassified scaffold (124.6 Mb);

and 23.4% of the total reads had no unique location in

the mapping process Only reads with a unique mapping

position in the oriented reference genome and a

map-ping score higher than 60 were used If reads had equal

matching scores in the A and D subgenomes, the reads

were excluded from the SNP calling procedure The

soft-ware package Ssaha pileup

(http://www.sanger.ac.uk/re-sources/software/ssaha2/) was used to find candidate

SNPs that required support from at least two sequence

reads Only the non-singleton SNPs, defined as those

where more than two accessions demonstrate the

pres-ence of the alternative alleles, were retained We then

fil-tered the polymorphic sites with minor allele frequency

(5%) and missing rate (10%) We randomly removed the

polymorphic sites in the high homoeologous

poly-morphic sites until the remaining polypoly-morphic sites

were at least 10 bp away from neighboring polymorphic

sites and got the final polymorphic datasets In the

filter-ing process, we found that about 40% of non-sfilter-ingleton

polymorphic sites had missing rates of less than 10%

We required that the common SNPs had a minor allele

frequency (MAF) greater than 5% and a missing data

rate less than 10% We only analyzed the SNPs that were

located in the 26 pseudomolecules of the TM-1

assem-bly, and the SNPs in the small scaffolds were removed

The SNPs were annotated using the GFF files (the annotation file of all coding regions of each gene) of the TM-1 reference genome sequence The software KaKs_Calculator was then used to compute the Ka/Ks ratio

Indel identification and annotation

Pindel software (version 0.20) [59] was used to identify the indels from the sequence reads In order to identify the indels, we used the Smalt outputs and kept only three kinds of reads: (1) the paired-end sequence reads that had a unique match on one side and no match on the other side; (2) the paired-end reads where small indels were detected in the Smalt output file; and (3) the paired-end reads that had a unique match in the genome but a low alignment score We converted the filtered reads into the Pindel input file format Only the indels that had the support of more than three reads and were detected in at least two accessions were retained as can-didate indels The genomic position of each indel was checked against the GFF file to allow for cotton genome annotation Genes with indels causing open reading frame changes were considered to have a mutation with

a large effect

SNP validation

We used two methods to validate the SNP calling First,

we used the assemblies of the TM-1 and XH21 genomes

to identify the genotypes of TM-1 and XH21 at the SNP site, respectively We compared the genotypes in the as-sembled sequences against those in the SNP datasets called from the resequencing data and calculated the SNP accuracy rate Second, we further randomly se-lected 68 SNPs and carried out PCR-based sequencing

in 11 randomly selected accessions (seven G hirsutum and four G barbadense accessions) with three replicates

We aligned all the PCR products against the TM-1 gen-ome using BLAST (version 2.2.28), and the reads with mapping lengths >90% and identity >80% were used for SNP validation Using the alignment results, we retrieved the genotypes in 11 accessions for each SNP site Only the genotypes consistent across three replicates were used to calculate the accuracy (Additional file 24: Table S21; Additional file 25: Table S22)

Population structure analysis

Using the Ssaha pileup package, we generated an SNP matrix for 147 cotton accessions and calculated the sim-ple matching coefficient of whole-genome SNPs as the genetic distance We used Phylip software (version 3.69) [60] to generate the neighbor-joining tree Dendroscope [61] was used to display the phylogenetic tree The miss-ing data in the cotton SNP genotype dataset were im-puted using Beagle (version 3.3.2) [62] We converted

Tiêu đề	Genomic Insights Into Divergence And Dual Domestication Of Cultivated Allotetraploid Cottons
Tác giả	Lei Fang, Hao Gong, Yan Hu, Chunxiao Liu, Baoliang Zhou, Tao Huang, Yangkun Wang, Shuqi Chen, David D. Fang, Xiongming Du, Hong Chen, Jiedan Chen, Sen Wang, Qiong Wang, Qun Wan, Bingliang Liu, Mengqiao Pan, Lijing Chang, Huaitong Wu, Gaofu Mei, Dan Xiang, Xinghe Li, Caiping Cai, Xiefei Zhu, Z. Jeffrey Chen, Bin Han, Xiaoya Chen, Wangzhen Guo, Tianzhen Zhang, Xuehui Huang
Trường học	Nanjing Agricultural University
Chuyên ngành	Genomics, Plant Genetics, Domestication
Thể loại	Research
Năm xuất bản	2017
Thành phố	Nanjing

Định dạng
Số trang	13
Dung lượng	2,43 MB