Nucleotide Fixation during Soybean Domestication and Intensive Breeding

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Impacts of Nucleotide Fixation during Soybean Domestication and Intensive Breeding

Plant domestication induced complex morphological and physiological modification of wild species to meet human needs. Artificial selection in the course of soybean domestication and intensive breeding results in substantial phenotypic divergence and the origin of modern cultivars from wild ancestors. In our project, population analysis of available sequencing accessions estimated that ~9.8 million single nucleotide polymorphisms reach saturation in soybean germplasm, and then ~5.3 million in cultivars. We observed a severe reduction of genetic diversity in the switch from wild soybeans to landraces and to elite cultivars. Selective sweeps defined by neutrality tests reveal 2,255 and 1,051 genes were involved in domestication and subsequent improvement, respectively. It corresponds to 3% of genome sequences and 4% of all the genes that were affected by artificial selection.

During the soybean breeding process, strong selective pressure on favorite phenotypes could cause nucleotide fixations in soybean cultivars in quite a short time. Both domestication and intensive breeding introduced ~0.1 million nucleotide fixations, which contributed to the soybean divergence. Meta-analysis of reported quantitative trait and selective signals with nucleotide fixation identified a series of putative candidate genes responsible for 13 agriculturally important traits. Nucleotide fixation mediated by artificial selection affected diverse molecular functions and biological reactions that associated with soybean morphological and physiological changes. Of them, plant-pathogen interactions are of particular relevance as selective nucleotide fixations happened in disease resistance genes, cyclic nucleotide-gated ion channels and terpene synthases.

Our analysis provides comprehensive insights into the impacts of nucleotide fixation during soybean domestication and intensive breeding, which would facilitate future gene mapping and molecular breeding practice.

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to all those gave me supports to complete this thesis. With my deep gratitude, I appreciate my supervisor, Professor Hon-Ming Lam, for his professional supervision, critical comments and patient guidance throughout this study. He has inspired me to explore scientific landscape, and helped me open an interesting career.

I sincerely thank the professors and colleagues in BGI-Shenzhen, especially Jun Wang and Huanming Yang who supported me in my researches and provided invaluable opportunities to take responsibility in a series of large cooperative projects. Great thanks should also give to Shengkai Pan, Weiming He, Fengya Zheng, Haiyang Wu and others who helped me a lot in executing this project for graduation.

I appreciate all the members in our laboratory, kind professors and staffs in the university; with them I have experienced a pleasant and colorful journey during my studies these years.

Last but not least, I will give the thanks to my parents, my young brother and sister for their love and support during my studies, which enable my devotion to resolving difficulties for a glorious career in science.

PUBULICATIONS

Here is a list of the publications I participated in during my PhD period (*Co-first and #Co-corresponding author):

1. H. Yang, C. Li, H.-M. Lam, J. Clements, G. Yan, S. Zhao#. (2015). “Sequencing consolidates molecular markers with plant breeding practice.” Theor Appl Genet. doi:10.1007/s00122-015-2499-8. (An invited review).

2. S. Zhao*#, F. Zheng*, W. He, H. Wu, S. Pan, H.-M. Lam#. (2015). “Impacts of nucleotide fixation during soybean domestication and improvement.” BMC Plant Biology 15, 81-92.

3. X. Z. *, H. L. *, Z. W. *, S. Zhao*, Y. T. *, Z. H. *, Y. W. *, et al. (2015). “The draft genome of Tibetan hulless barley reveals adaptive patterns to the high stressful Tibet Plateau.” Proc Natl Acad Sci 112, 1095–1100.

4. M. Chen*, P. Song*, D. Zou, X. Hu, S. Zhao#, S. Gao#, F. Ling#. (2014). “Comparison of Multiple Displacement Amplification (MDA) and Multiple Annealing and Looping-Based Amplification Cycles (MALBAC) in Single-Cell Sequencing.” PLoS ONE 9, e114520.

5. C. Qin, …, S. Zhao, et al. (2014). “Whole-genome sequencing of cultivated and wild peppers provides insights into Capsicum domestication and specialization.” Proc Natl Acad Sci 111, 5135–5140.

6. He, N. *, C. Zhang*, X. Qi*, S. Zhao*, Y. Tao*, et al. (2013). "Draft genome sequence of the mulberry tree Morus notabilis." Nat Commun 4: 2445.

7. Z. Gao*, S. Zhao*, W. He*, L. Guo*, Y. Peng*, J. Wang*, et al. (2013). “Dissecting yield-associated loci in super-hybrid rice by resequencing recombinant inbred lines and improving parental genome sequences.” Proc Natl Acad Sci 110(35): 14492-14497.

8. H.-Q. Ling#, S. Zhao*, D. Liu*, J. Wang*, H. Sun*, C. Zhang*, et al. (2013). "Draft genome of the wheat A-genome progenitor Triticum urartu." Nature 496(7443): 87-90.

9. J. Jia#, S. Zhao*, X. Kong*, Y. Li*, G. Zhao*, W. He*, R. Appels*, et al. (2013). "Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation." Nature 496(7443): 91-95.

10. S. Zhao*, P. Zheng*, S. Dong*, X. Zhan*, Q. Wu*, et al. (2013). "Whole-genome sequencing of giant pandas provides insights into demographic history and local adaptation." Nat Genet 45(1): 67-71.

11. Y. Li*, S. Zhao*, J. Ma*, D. Li*, et al. (2013). "Molecular footprints of domestication and improvement in soybean revealed by whole genome re-sequencing." BMC Genomics 14(1): 579.

12. W. He, S. Zhao, X. Liu, S. Dong, J. Lv, D. Liu, J. Wang, Z. Meng. (2013). “ReSeqTools: an integrated toolkit for large-scale next-generation sequencing based resequencing analysis.” Genet. Mol. Res. 12, 6275–6283.

13. R. K. Varshney, C. Song, …, S. Zhao, et al. (2013). "Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait improvement." Nat Biotechnol 31(3): 240-246.

14. R. Li, W. Fan, …, S. Zhao, et al. (2010). "The sequence and de novo assembly of the giant panda genome." Nature 463(7279): 311-317.

LIST OF TABLES

Table 2.1 Summary of sequencing soybean accessions....................33

Table 3.1. The genome-wide pairwise nucleotide diversity θw and θπ (× 10-3).....44

Table 3.2. Meta-analysis of the published QTLs responsible for agriculturally important traits and selective genes with nucleotide fixation during early domestication. 72

Table 3.3. Meta-analysis of the published QTLs responsible for agriculturally important traits and selective genes with nucleotide fixation during modern improvement. 74

Table 3.4 Top KEGG categories of the selective genes with nucleotide fixation during early domestication. 76

Table 3.5 Top KEGG categories of the selective genes with nucleotide fixation during modern improvement. 77

Table 3.6. Functional analysis of the selective genes with nucleotide fixation based on agriGO. 83

LIST OF FIGURES

Figure 1.1 Sequencing strategies of representative NGS technologies...........

Figure 1.2 Genomic features exhibited by K-mer analysis and standards for a complex plant genome

Figure 1.3 The trends of crop sequencing toward breeding practice.............

Figure 1.4 Strong artificial selection is expected to cause selective sweeps........

Figure 1.5 Distribution of genetic variations across the soybean genome.........

Figure 2.1 The phenotypes of seeds in typical wild, landrace and elite soybeans....

Figure 2.2 A schematic pipeline for SNP detection from sequencing to population analysis

Figure 3.1 Detection of SNPs in sequencing soybean accessions...............

Figure 3.2 Analysis of genetic diversity and phylogenetic relationship among soybean accessions

Figure 3.3 Phylogenetic tree and population structure of accessions from three distinct gene pools

Figure 3.4 LD decay determined by squared correlation coefficient of allele frequencies (r2) in against distance among three soybean populations

Figure 3.5 The diversity pattern of artificial selection regions during domestication and genetic improvement.

Figure 3.6 Footprints of artificial selection during early domestication and modern improvement.

Figure 3.7 Detection of candidate genome regions and genes underwent artificial selection during domestication and genetic improvement.

Figure 3.8 Pathway analyses for domestication and improvement genes by KEGG..

Figure 3.9 The gene diversity of genomic regions of seed size and seed coat blooming on chromosome 13..

Figure 3.10 Distribution of nucleotide fixations on each soybean chromosome.....

Figure 3.11 The distribution of nucleotide fixation over the genome versus in the selective regions.

Figure 3.12 (A) PCA and (B) phylogenetic tree among soybean accessions based on nucleotide fixation.

Figure 3.13 Functional annotations of selective genes with nucleotide fixation introduced in early domestication and modern improvement.

Figure 3.14 Over-represented GO category of cellular component in the selective genes with nucleotide fixation

Figure 3.15 Over-represented GO category of molecular function in the selective genes with nucleotide fixation

Figure 3.16 Over-represented GO category of biological process in the selective genes with nucleotide fixation

Figure 3.17 The allele frequency of SNPs in wild soybeans that were fixed in cultivars..

Figure 3.18 The accumulated KEGG pathway in the genes with nucleotide fixation in wild soybeans.

Figure 3.19 Selective genes with nucleotide fixation involved in plant hormone signal transduction pathway

Figure 3.20 The protein topology CNG channels involved in plant-pathogen interaction pathway

TABLE OF CONTENTS

ABSTRACT.........................................................

ACKNOWLEDGEMENTS.............................................

PUBULICATIONS

LIST OF TABLES....................................................

LIST OF FIGURES...................................................

ABBREVIATIONS....................................................

Chapter 1 Introduction

1.1 The advances of sequencing technologies........................

1.2 The trends of crops sequencing...............................

1.3 Genome sequencing efforts in legumes..........................

1.4 The nature of soybean domestication...........................

1.5 Objectives of this research..................................

Chapter 2 Materials and Methods

2.1 Soybeans sequenced and data collection.........................

2.2 Data processing and SNP detection............................

2.3 Statistics of genetic diversity.................................

2.4 Population structure and phylogeny...........................

2.5 Detection of artificial selection signals..........................

2.6 Identification of nucleotide fixations...........................

2.7 Functional enrichment of selective genes........................

2.8 Inferring protein topology..................................

Chapter 3 Analyses and Results

3.1 Estimation of SNPs among soybean populations...................

3.2 Reduction of genetic diversity in soybean breeding.................

3.3 Decrease of haplotype diversity during soybean breeding............

3.4 Signals of soybean domestication and intensive breeding.............

3.5 Identification of genes with QTL mapping and selective sweeps........

3.6 Detection of nucleotide fixation affected by artificial selection.........

3.7 Agronomic traits affected by selective nucleotide fixation............

3.8 Nucleotide fixations in wild soybeans...........................

3.9 Plant-pathogen interaction affected by selective nucleotide fixation......

Chapter 4 Discussion and Conclusion

4.1 Insights into soybean domestication...........................

4.2 Nucleotide fixations were crucial in soybean breeding...............

4.3 Artificial selection accelerated nucleotide fixation..................

4.4 Evolutionary perspective of nucleotide fixation....................

4.5 Conclusion.............................................

Reference

ABBREVIATIONS

BACBacterial artificial chromosome

GIF1Grain Incomplete Filling 1

CNGCyclic nucleotide-gated ion

GWASGenome-wide association studies

KEGGKyoto Encyclopedia of Genes and Genomes

LDLinkage disequilibrium

LTRLong terminal repeats

PBSPopulation branch statistic

PCAPrinciple component analysis

QTLQuantitative trait loci

SNPSingle nucleotide polymorphism

SOLiDSequencing by oligo ligation detection

TETransposable element

WGDWhole genome duplication

WGSWhole genome shotgun

Chapter 1 Introduction

1.1 The advances of sequencing technologies

The advances of next-generation sequencing (NGS) technologies have made tremendous strides over time in plant and animal studies (Shendure and Ji 2008). In this project, an amount of sequencing data was used to analyze soybean domestication and intensive breeding, thus we make an overall description on the NGS technologies here. Typically, NGS represented by HiSeq/MiSeq from Illumina, SOLiD/Ion Torrent PGM from Life Sciences, and GS FLX Titanium/GS Junior from Roche. Over the past years, NGS technologies have been rapidly evolving in kinds aspects, including making robust protocols for construction of sequencing libraries, developing effective new approaches for data mining, and revolutionizing a paradigm shift in experimental design (Metzker 2009; Koboldt et al. 2013).

In the end of year 2005, the 454 Sequencer system was invented according to crucial emulsion PCR and pyrosequencing technology, which relies on the detection of the released pyrophosphate during DNA extension (Figure 1.1a). The library DNAs are denatured into single strands and each will be attached to a magnetic bead followed by emulsion PCR. Each DNA fragment then will be amplified in a separate micro-fabricated plate. Four kinds of normal nucleotides are added to the sequencing reaction. Once the correct nucleotide incorporates, the pyrophosphate will be released with the help of luciferase that drives luciferin into oxyluciferin. This process will generate a burst of lights, which can be captured by a charge-coupled device camera as corresponding to the microarray coordinates of specific wells.

Genome Analyzer was first released by Solexa in 2006 and then purchased by Illumina the next year. Core technologies for Illumina sequencing system include reversible terminators, DNA clusters and the concept of sequencing by synthesis. Generally speaking, the sequencing adaptors are added to both ends of DNA fragments, which will be denatured to single strands and then grafted to the flow cells followed by about 30 times bridge amplification (Figure 1.1b). In this way, DNA clonal clusters could be generated according to manufactory’s instruments. The amplicons are single stranded after linearization and a sequencing primer will be hybridized to a universal sequence. Each sequencing cycle consists of single-base extension with a specific DNA polymerase and a mixture of four kinds of dideoxynucleotides (ddATP, ddGTP, ddCTP, ddTTP) with different fluorescent dyes. Due to the reversible terminators at the 3’ hydroxyl position, thus only one nucleotide can be added once in each sequencing cycle. After the acquisition of signals of each nucleotide, the removable terminator will be cleaved for the next cycle. When the template is turned to double-strand DNA, the captured signal can tell each base of the template. In our project, the sequencing data were mainly generated by improved Illumina technologies.

The SOLiD sequencing system is short for Sequencing by Oligo Ligation Detection that adopts two-base sequencing technology based on hybridization (Figure 1.1b). Library construction for SOLiD is similar to Roche/454 technology. DNA samples are first broken into fragments and ligated to adapters, and then attached to beads and amplified by emulsion PCR. One difference is that SOLiD employs ligase instead of polymerase. Libraries will be sequenced using 8-base single strand fluorescence probe. The 5'-end of a primer carries one of a fluorescent label, which is determined by the two continuous bases at 3' end. After the probe is ligated, the 8-bp oligonucleotides will be cleaved off on the fifth base of 3' end, which exposes the 5’ phosphate end for next ligation. At the same time, the fluorescent will be recorded and the two bases of ligation position can be distinguished. In the first cycle, the bases 1 and 2 of the template will be determined, then in the second cycle the bases 6 and 7 will be determined, and so on. Moreover, the ligation sequencing can be performed at the opposite. In SOLiD sequencing platform, each base can be identified twice, resulting in high accuracy of base calling.

Figure 1.1 Sequencing strategies of representative NGS technologies for (a) 454 platform, (b) Illumina platform and (c) SOLiD platform. This figure was modified from Shendure and Ji (Shendure and Ji 2008).

1.2 The trends of crops sequencing

Nowadays, genome sequencing is a widely accepted approach to understand the molecular basis of phenotypic variations, accelerate gene cloning and marker assisted selection, as well as improve the exploitation of genetic diversity for efficient crop improvement. The recent years’ advancement of high-throughput next-generation sequencing (NGS) platforms brings us a deluge of crop genomes, making use of improved high-throughput read lengths and single-base accuracy, reduced costs, as well as matching analytical approaches in genomic sequencing. The Wild mustard plant Arabidopsis thaliana (Initiative 2000) was the first sequenced plant and rice Oryza sativa (Goff et al. 2002; Yu et al. 2002; Sequencing Project 2005) remains the first sequenced crop species using the Sanger sequencing based bacterial artificial chromosome (BAC) clones and whole genome shotgun (WGS) approaches. The available NGS platforms such as Roche 454 Pyrosequencer, Applied Biosystems SOLiD and Illumina Solexa Hiseq series have applied in draft genome sequencing of multiple plants. Taking these platforms into consideration, the hybrid sequencing approaches that combined the traditional Sanger methods with NGS technologies have been widely used to sequence both small and complex plant genomes. All the sequenced crop plants available at present can be broadly classified into three waves, which would be briefly described in the following sections and more details in our recent review (Yang et al. 2015).

In the first wave, plant models were sequenced using BAC-by-BAC approach to construct physical maps, such as Arabidopsis (Initiative 2000), rice (Goff et al. 2002; Yu et al. 2002; Sequencing Project 2005), Brachypodium (Initiative 2010) and Medicago (Young et al. 2011). The genome sizes for these plants are usually small, achieving high quality or close-to-complete genome sequences by Sanger sequencing.

In the second wave, several staple or major economic crops were sequenced, such as sorghum (Paterson et al. 2009), soybean (Schmutz et al. 2010) and maize (Schnable et al. 2009). Compared with the models, these genomes are relatively large with a high level of repetitive sequences (>60%). Combination of the Sanger method and WGS strategy was applied to obtain high quality rather than complete genome assembly. Of these genomes, maize still employed BAC-by-BAC strategy for the reason that ~85% of its genome are transposable elements (Schnable et al. 2009). During this wave, NGS platforms emerged and matured making the sequencing cost substantially decreased, which promoted the third and current wave. Tens of thousands of genomes are in the process of being sequenced, including some important crops with large genomes such as including barley (International Barley Genome Sequencing Consortium et al. 2012; Zeng et al. 2015) and wheat (Brenchley et al. 2013), orphan crops such pigeonpea (Varshney et al. 2012) and chickpea (Varshney et al. 2013), horticultural plant species including fruits such as apple (Velasco et al. 2010) and peach (International Peach Genome Initiative et al. 2013), vegetables such as potato (Xu et al. 2011) and cabbage (Wang et al. 2011), flowers such as plum flower (Chen et al. 2012), Carnation (Yagi et al. 2014) and orchid (Cai et al. 2015). With WGS strategy, these genomes were sequenced and assembled on single or multiple NGS platforms. However, most genome assemblies are draft sequences and need extra effort to improve their quality.

Despite the achievements obtained in recent years, some factors remain challenging in crop sequencing, such as a high level of heterozygosity, large and diverse repetitive elements, and frequent polyploidy that contributes to the dynamic complexity of crop genomes. Heterozygosity will cause a lower peak before the highest one in K-mer analysis, and polyploidy will cause another peak after the highest one (Figure 1.2). Sequencing errors contribute to the high head whereas repetitive sequences and large genome size will cause a long tail.

Figure 1.2 Genomic features exhibited by K-mer analysis and standards for a complex plant genome. The difficulties that affect plant genome assembly can be reflected using K-mer analysis: the lower peaks before and after the highest indicate heterozygosity and polyploidy, respectively. The long tail probably represents amount of repetitive sequences and a large genome. From Yang et al (Yang et al. 2015).

Typically, there are three stages in crop sequencing: the genome scale, population scale and panel scale (Figure 1.3). Most of the crops are still in the genome scale, which primarily focuses on the genome quality. Usually, a single genome is not sufficient for a given crop species. Sequence the wild ancestor of a given crop becomes necessary, such as undomesticated rice (Chen et al. 2013) and wild soybean (Kim et al. 2010; Qi et al. 2014). On the population scale, mining genetic variations will be important to study the population structure or to detect marker-trait association. A series of models have served to variation calling (Nielsen et al. 2011; Nielsen et al. 2012), based on which, genetic diversity, haplotypes and linkage disequilibrium (LD) can then be evaluated and inferred. They are crucial in understanding demographic process, geographic origin, natural selection, and domestication. Genome-wide genetic variations make it possible to explore association studies and linkage analysis in geographical/natural and breeding populations, respectively. What’s more, kinds of RNAs and corresponding roles are now being identified and quantified by transcriptome sequencing (Mortimer et al. 2014).

Figure 1.3 The trends of crop sequencing toward breeding practice. X-axis represents sequencing samples and Y-axis represents the accumulation of molecular and breeding knowledge. The inset shows the accumulation of genetic variations based on sequencing and molecular markers on microarrays for several crops. The crop sequencing could be divided into the genome scale, population scale and panel scale. At each stage, we need to accumulate different knowledge for a give crop as described. From Yang et al (Yang et al. 2015).

The last panel scale will pay more attention to precise phenotypic assessments. The marker-trait association will be re-evaluated in different environments. The contribution of environmental factors to a specific phenotype is of particular importance, especially for abiotic stress tolerance (Xu et al. 2012b). A panel of populations should be developed with large-scale phenotyping and multiple environment typing. The interaction between genes and traits and environments can then be better studied using multiple sequencing approaches. Novel germplasm with better production and quality will eventually be created through computational modeling and modern breeding technologies. This scenario is exactly true in soybean sequencing researches and breeding activities.

1.3 Genome sequencing efforts in legumes

The genus Glycine consists of two subgenera (Doyle et al. 2003): the subgenus Soja includes two annual self-pollinated plants, the cultivated soybean G. max and its wild progenitor G. soja; while the other subgenus Glycine comprises more than a dozen wild perennial species. Cultivated soybean is a globally important crop, providing oil and protein for humans and animals. Selection and cultivation began more than 3,000 years ago (Hymowitz 1970). Until now, seven members of legumes have been sequenced and decoded: Lotus, soybean and wild soybean, Medicago, pigeonpea, chickpea, and common bean. Here, I will present a brief review on the results of these genomes from structural, functional, comparative and evolutionary perspective.

The soybean genome with the size of ~1.1 Gb was sequenced using WGS strategy and Sanger methods, and it was integrated with physical and high-density genetic maps to create a chromosome-scale draft sequence assembly (Schmutz et al. 2010). It has the largest size among all other sequenced legumes and contains a total of 46,430 protein-coding genes with an abundance of transposable elements. The LTR retrotransposon is composed of 42% of the soybean genome, of which Gypsy-like elements are overrepresented than Copia in soybean genome. A relatively small number of molecular markers and 5,671 transcription factors were also identified in the soybean genome. A total of 109 drought responsive genes were identified, suggesting their role in tolerance to drought in the soybean plant. The genes involved in lipid signaling, degradation of storage lipids and membrane lipid synthesis expanded, resulting in production of edible oil in soybean seed.

Within leguminosae family, soybean belongs to millettioid subgroup of papilionoids subfamily. After the divergence, soybean genome underwent recent WGD around 13 Myr ago, resulting in expansion of its genome size and TE accumulation in its genome (Shoemaker et al. 2006). This recent WGD resulted in local gene duplication and gene rearrangement in soybean genome contributed the expansion of oil biosynthesis genes and nodulin related genes. Homology analysis revealed that 61.4% of the homologous genes were in blocks involving only two chromosomes, only 5.63% spanning three chromosomes, and 21.53% traversing four chromosomes of soybean. Polyploidy in soybean leads to differential gene loss rate in soybean, which indicates an exponential decay pattern of rapid gene loss after duplication, showing over time. An accurate soybean genome sequence will facilitate the identification of the genetic basis of various soybean traits, and accelerate the creation of improved soybean varieties.

Within leguminosae family, Medicago belongs to galegoids subgroup of papilionoids subfamily. The ~375 Mb Medicago genome has been sequenced by hybrid of BAC-by-BAC and WGS approaches (Young et al. 2011). The Medicago genome contains 62,388 protein-coding genes, which are most predominant as compared to all other legumes. A total of 3,692 transcription factors were found in Medicago genome, reflecting their role in nodule formation and symbiotic nitrogen fixation. It is hypothesized that two important nodulation-specific signaling components in Medicago might have evolved from more ancient genes originally functioning in mycorrhizal signaling and then duplicated by the WGD event around 58 million years ago.

The draft genome sequences of six leguminous crops reveal that Medicago genome contains highest number of protein coding genes with a very high gene density (166.36/Mb). Experienced a recent WGD ~13 million years ago, the soybean genome has the largest size with more abundance of TEs and relatively low gene density (48.87/Mb). The Medicago genome is rich in disease-resistance related genes and nodulin related genes, whose expansion occurred during divergence of Medicago from papilionoid family around 58 million years ago after experiencing a WGD. The functional domains and gene families for several transcription factors, transporters and receptor protein kinases have expanded in Lotus genome, reflecting its role in symbiotic nitrogen fixation. The genome of Lotus contains a large number of functional micro-RNAs. Due to segmental duplication and whole genome-wide duplication, a large number of drought tolerance-related genes were evolved in pigeonpea and soybean, implying an insight regarding genetic architecture of pigeonpea and soybean genomes for drought tolerance.

Around 54 million years ago, the papilionoideae subfamily diverged into two major subgroups, the millettioid (soybean and pigeonpea) and galetoid (Medicago, Lotus and chickpea). Within the millettioid clade, pigeonpea diverged from soybean around 10-20 Myr ago. After the divergence, soybean genome underwent recent WGD around 13 Myr ago, resulting expansion of its genome size and accumulation of TEs in its genome. Within the galetoids clade, chickpea diverge from Lotus and Medicago around 20-30 Myr and 10-20 Myr ago, respectively. A WGD and local gene rearrangement occurred in Medicago genome and Lotus genome, resulting in sub- or neo-functionalization of signaling components and regulators showing specialized role in nodulation. Comparative analyses between millettioid and galegoid subgroups reveal that a number of 16,380 gene families are shared by these two subgroups, whereas a total of 2,951 and 5,331 gene families are specific to millettioid and galegoid subgroup, respectively. These sequences provide us important resources to elucidate the evolution of the legume family. Once the reference genomes are available, a large number of different accessions could be used to detect genetic variations, natural and artificial selection, and population dynamics.

1.4 The nature of soybean domestication

Modern soybeans were domesticated from their wild progenitor in East Asia, which prevailed in China by the Zhou Dynasty (~2500 B.P.) according to historical records (Lee et al. 2011). During soybean domestication, complex morphological changes occurred that could distinguish cultivars from their wild ancestors. Artificial selection played an important role in pursuing desirable traits to meet human needs, which reduced the genetic diversity in soybean, shaping selective sweeps along the genome (Nurminsky et al. 1998). Selective sweep can reveal the inheritance of regions around adaptive alleles (Figure 1.4). With the availability of the soybean genome (Schmutz et al. 2010), population analysis between wild and cultivated soybeans has unearthed genome-wide signals of artificial selection (Lam et al. 2010; Li et al. 2013). Resequencing of 31 wild and cultivated soybean genomes identified patterns of genetic diversity and selection. A number of genetic variations revealed the molecular footprints of the domestication and improvement in 25 diverse soybean accessions. The reduction of genetic diversity in cultivated soybean had shaped a barrier for improvement of soybean cultivars.

Figure 1.4 Strong artificial selection is expected to cause selective sweeps. This characteristic pattern will reveal genomic regions that have suffered domestication. Modified from Leif Andersson and Michel Georges (Andersson and Georges 2004).

The available soybean genome and NGS technologies also provide unprecedented opportunity to investigate domestication events and phenotypic diversification (Schmutz et al. 2010). Now researchers now can exploit genetic diversity in wild soybeans and landraces for sustainable enhancement of soybeans. They also assembled seven representative accessions genome and got the pan-genome sequences for the wild soybean (Li et al. 2014). The pan-genome of wild soybean also showed the extent of novel genes and alleles in wild relatives that can be employed in breeding activities through introgression (Figure 1.5). Recently, researchers extended soybean sequencing to 302 accessions collected from worldwide to infer the phylogeny, domestication and functional loci for agronomic traits (Zhou et al. 2015). GWAS revealed several selected regions responsible for nine domestication or improvement traits, and identified 13 previously uncharacterized loci for agronomic traits including oil content, plant height and pubescence form.

Figure 1.5 Distribution of genetic variations across the soybean genome. The numbers represent gene percentages with corresponding mutations in 1-Mbp windows. The figure was modified from Li et al (Li et al. 2014).

As described above, gene mapping and genomic analyses can be used to detect heritable changes during plant domestication (Vaughan et al. 2007). Re-sequenced soybeans representing wild and cultivated accessions disclosed the nature and extent of genetic diversity in both populations (Kim et al. 2010; Lam et al. 2010; Chung et al. 2014). Besides, numerous agronomic traits have been proposed to be controlled by a small number of genes or major QTLs (Doebley et al. 2006; Qi et al. 2014). However, more efforts are still needed in order to narrow down these QTL regions for further gene mapping.

In evolution, a mutation that happens to be beneficial will spread to the population immediately through selection (Nielsen 2006). During crop domestication, strong selective pressure will fix the traits of interests in quite a short time (Innan and Kim 2004). As a result, advantageous mutations responsible for the traits of interests will be subject to fixation in the breeding population. These fixation events are different from those happened in natural populations, due to the fact that artificial selection frequently exerted on alleles that were neutral before domestication. Thus, to understand this kind of nucleotide fixation is crucial to draw the picture of soybean evolution and domestication.

1.5 Objectives of this research

The modern cultivated soybean Glycine max was domesticated from its wild progenitor Glycine soja several thousands years ago. Although wild and cultivated soybeans exhibit substantial morphological variations, their genomic sequences show only a small amount of differences (Kim et al. 2010; Qi et al. 2014). Followed by the early domestication, the emergence of modern breeding technologies promoted the genetic improvement of soybean landraces, which produced the elite cultivated accessions. Based on the above description, the objectives of this research are to explore the following issues:

1) To estimate the genetic variations and genetic diversity in wild, landrace and elite soybean populations;

2) To elucidate the consequences and footprints of artificial selection accompanying soybean domestication and modern intensive breeding;

3) To detect nucleotide fixations caused by artificial selection in soybean breeding;

4) To understand the impacts of nucleotide fixation in the differentiation of wild and cultivated soybeans.

Chapter 2 Materials and Methods

2.1 Soybeans sequenced and data collection

The available soybean accessions including 24 popular elites, 15 early landraces and 31 wild were described in several researches (Kim et al. 2010; Lam et al. 2010; Li et al. 2013; Chung et al. 2014). These samples were mainly from the mini-core soybean germplasm that collected from large ecological area of China and South Korea (Wang et al. 2006; Guo et al. 2014). These represent all major operational taxonomic units of the Chinese soybean germplasm (Li et al. 2010). These soybean accessions originate from major breeding areas of soybeans ranging over 24.1- 46.4 ºN and 102.4 - 126.6 ºE. They exhibit substantial morphological differences, for example wild soybeans produce small black seeds while elite cultivars have high seed yield. Seed size is one of the most interesting domestication phenotypes (Figure 2.1).

Figure 2.1 The phenotypes of seeds in typical wild, landrace and elite soybeans. This figure was modified from Li et al (Li et al. 2013).

The total genomic DNA of each sample was extracted from fresh leaves of dark-grown plant. Sequencing DNA libraries were prepared and sequenced on Illumina platforms. Short reads were transformed from image files using Illumina’s base-calling pipeline with default parameters. All sequencing data were downloaded from NCBI with Sequence Read Archive (SRA) under accession number SRP015830, SRA020131 SRA009252, and ERP002622. Statistics of sequencing data and sample information were listed in Table 2.1 with details.

Table 2.1. Summary of sequencing soybean accessions collected from available publications.

Name in papers

New Name

Category

Geographic origin

Data (Gb)

Reference

cul01

A01

Elite

Korea

28.6

(Chung et al. 2014)

cul02

A02

Elite

Korea

28.0

cul03

A03

Elite

Korea

21.6

cul04

A04

Elite

Korea

17.0

cul05

A05

Landrace

Korea

18.7

cul06

A06

Landrace

Korea

18.0

cul07

A07

Landrace

Korea

20.9

cul08

A08

Landrace

Korea

16.2

cul09

A09

Elite

Korea

19.8

cul10

A10

Elite

Korea

18.8

E1

E1

Elite

China, Fujian

2.75

(Li et al. 2013)

E2

E2

Elite

Heilongjiang

5.06

E3

E3

Elite

Heilongjiang

4.96

E4

E4

Elite

China, Peking

2.98

E5

E5

Elite

China, Henan

5.04

E6

E6

Elite

China, Peking

5.24

E7

E7

Elite

China, Henan

2.93

E8

E8

Elite

China, Hebei

2.48

E9

E9

Elite

China, Hebei

3.33

C01

C01

Elite

China, Shandong

5.65

(Lam et al. 2010)

C02

C02

Elite

China, Liaoning

5.96

C08

C08

Elite

USA

9.39

C12

C12

Elite

China, Shanxi

5.91

C14

C14

Elite

Brazil

6.32

C16

C16

Mutation

China, Taiwan

6.43

C17

C17

Landrace

China, Sichuan

5.83

C19

C19

Elite

China, Jilin

6.29

C24

C24

Elite

China, Jiangxi

6.34

C27

C27

Elite

China, Hebei

6.76

C30

C30

Elite

China, Henan

5.88

C33

C33

Elite

Heilongjiang

5.95

C34

C34

Landrace

China, Guangxi

6.31

C35

C35

Landrace

China, Guangdong

5.81

W01

W01

Wild

China, Peking

6.80

W02

W02

Wild

China, Liaoning

6.42

W03

W03

Wild

Inner Mongolia

6.01

W04

W04

Wild

China, Henan

4.35

W05

W05

Wild

China, Henan

9.44

W06

W06

Wild

Heilongjiang

2.03

W07

W07

Wild

China, Liaoning

6.16

W08

W08

Wild

Heilongjiang

5.75

W09

W09

Wild

China, Liaoning

2.59

W10

W10

Wild

Heilongjiang

5.52

W11

W11

Wild

China, Shanxi

5.58

W12

W12

Wild

China, Anhui

9.09

W13

W13

Wild

Inner Mongolia

9.13

W14

W14

Wild

Inner Mongolia

3.27

W15

W15

Wild

China, Henan

5.79

W16

W16

Wild

Heilongjiang

4.77

W17

W17

Wild

China, Liaoning

2.90

(Li et al. 2013)

L1

L01

Landrace

China, Sichuan

3.18

L2

L02

Landrace

Guangdong

2.91

L3

L03

Landrace

China, Hunan

2.91

L4

L04

Landrace

China, Shanxi

2.83

L5

L05

Landrace

China, Henan

2.89

L6

L06

Landrace

China, Jiangsu

5.42

L7

L07

Landrace

China, Hebei

3.14

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.