This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Tomato Solanum lycopersicum is one of the most economically important vegetable crop worldwide. It is a rich source of micronutrients for human diet and a model species for fruit quality. Investigation of tomato genetic resources is a crucial issue for evolution and genetic studies and for tomato breeding.
Since the late 18th and throughout the 19th and early 20th century a huge array of crosses and selection activities has taken place in Europe giving rise to a rich collection of tomato landraces (Bai et al. 2007; Grandillo et al. 2011). In particular, an extensive selection work was performed in Italy by "Campania" farmers that developed several varieties adapted to local conditions and with quality requirements well delineated for specific use. Among them, S. Marzano and Vesuviano varieties, grown in rich volcanic soil surrounding Vesuvius are considered important models (milestone) for fruit quality parameters. Indeed, these varieties were implemented by scientists and breeders to obtain modern tomato varieties (Bai et al. 2007).
The advent of genomics era has brought a substantial increase in generation of data, knowledge and tools that can be employed in applied research. Candidate genes for important traits can be identified and exploring functional nucleotide polymorphisms within genes of interest can facilitate breeders in combining favourable alleles. The decoding of the complete genome of Heinz 1617 tomato reference genome will allow a better understanding of genetic basis of agronomic traits for developing novel genotypes. (Alba 2012 â€¦).
Genome sequences and genomics tool offer exciting new perspectives and opportunities to track rates of sequence divergence over time, and provide hints about how genes evolve and generate new products by re-organization and shuffling of genomic sequences. Variant catalogues, however, will remain incomplete if forms of variation are undocumented. Indeed, good genome coverage is required to improve variant detection and accuracy and to study the polymorphism distribution across genomes. Sequence sub-assembly approaches are very promising tools to improve genome reconstruction. Indeed, genome contigs as far as possible are assembled independently and joint using a reference genome. Sequences could be compared with each other with or without reference genome, capturing a broader spectrum of sequence variation respect mapping methods (Gan et al 2011; Bevan, Nature 477).
Here we describe the generation and analysis of S. Marzano and Vesuviano tomato genome sequences. First, we reconstructed the genomes using an innovative iterative assembling approach. Then, we documented the variation discovered, describing the distribution of variants between genotypes, assessing molecular features and exploring the quantitative and qualitative impact of functional variants in genes related to fruit quality. Finally, we illustrate how sequences can be used to investigate the molecular origins of phenotypic variation.
We sequenced San Marzano (SM) and Vesuviano (RSV) tomato varieties through (?) Illumina 100 bp paired-end reads with an insert size of about 250 bp. We obtained about 177,758,218 reads for SM and 155,751,012 reads for RSV that, considering the size of about 760 Mb of 'Heinz 1706' reference genome (The Tomato Genome Consortium, Nature 2012), correspond to an average expected depth of about 45.48 and 39.85 respectively (Supplemental table S1).
We choose to use a genome reconstruction method based on a combination of iterative read mapping against the tomato reference genome and de novo assembly (Gan et al., Nature 2011; Supplemental Figure 1) since a potential advantage of such an approach on a single pass alignment is the ability to describe complex loci. At the same time, this method is less demanding in terms of sequencing depth required compared to a complete de novo assembly and does not require the use of multiple libraries of different insert size. The size of the assembled genomes is very similar to the reference genome and corresponds to a 99.8% of it (Supplemental table 2). The slightly lower size observed in the reconstructed genomes may be related to a low efficiency of the method in detecting long insertions (Gan et al., Nature 2011). We aligned the reads to the final assemblies to detect regions with a low read coverage which may correspond to complex polymorphisms. The average N50 length of contiguous regions between polymorphic regions was of 72.7 and 77.5 kbps for SM and RSV respectively, while polymorphic regions sizes had a maximum of 88.7 Kbps with an average size of 1.4 Kbps (Supplemental table 3). Variants are reported referring the positions to the reference genome, thus allowing the comparison of variants between the two varieties, and with the available annotations on the reference genome in a consistent way. These polymorphic regions insist on 368 genes for the RSV variety and 328 genes for SM. 850 of the genes interested by polymorphic regions are in common between the two varieties.
We detected 177,179 and 206,867 single base variants compared to the reference genome for SM and RSV respectively (Table 1). A small fraction (3.3% in average) of the variants was ambiguous and, most probably, corresponded to heterozygous variants or misalignments due to repeated sequences. In fact, 61% (RSV) and 63% (SM) of the putative heterozygous variants in either cultivar were located in annotated repeats and are most probably an artifact. We identified a fairly large number of indels or unbalanced substitutions (209390 in global considering both varieties). Indels sizes varied from single base up to 6011 bp in the case of insertions and 36162 bp in the case of deletions. While the majority of the indels were shorter than 6 bases, we detected 97 insertions longer than 100 bp in RSV affecting 62 genes and 91 in SM affecting 60 genes. However, we noticed that while SNPs were mostly specific of each cultivar, most of the insertions (71.4% of RSV insertions; 73.3% of SM insertions) and deletions (54.3% of RSV deletions; 56.6% of SM deletions) detected in each variety were shared with the other genotype and occurred with an average frequency of 1 indel every 6 Kb. These findings resemble quite closely the frequency of estimated indel error rates reported for the reference genome (1 every 6.4 Kbps; The Tomato Genome Consortium, Nature 2012) and suggest that common indels may be due to errors in the reference genome rather than to true indels.
We took advantage of the existing high quality Solanum lycopersicum reference annotation (ITAG2.3) released by the International Tomato Genome Sequencing Consortium (The Tomato Genome Consortium, 2012) to annotate the assembled genomes for the SM and RSV cultivars. The original annotations were transferred taking into account the cumulative effect of insertions and deletions along the whole length of the chromosomes (Methods). In order to evaluate the reliability of the transferred annotations we analyzed the potential effect of variants detected in each variety when projected on the corresponding protein coding sequences. Most mutations were located outside the genic loci with only a smaller fraction harboring SNPs or indels inside their coding sequences (Tables 2, 3 and 4). In particular, we found that 32128 RSV annotations and 32399 SM annotations were not affected by mutations on the CDS while 1934 and 1707 RSV and SM transferred annotations were affected by mutations potentially causing aminoacidic substitutions of unknown effect on the protein function (Figure 1). In global these annotations corresponded to 98.1% (RSV) and 98.2% (SM) of the total and were considered reliably transferred. A small number of annotations was (?) predicted to have an altered gene structure due to mutations in splice sites and were classified as "transferred with putative altered structure". Moreover, 606 RSV and 565 SM genes, corresponding respectively to 1.7% and 1.6% of the total annotations were predicted to be potentially affected by disrupting mutations such as frameshifts and alteration of the start or stop codon and could not be reliably transferred (Figure 1 and Supplementary Table 5).
Analysis of genetic variants in fruit quality related genes
The analysis of genetic differences between SM and RSV genomes and the reference tomato genome SL2.40 has been focused on four gene classes related to fruit quality (ascorbate biosynthesis; carotenoid pathway; ethylene-related genes; cell wall related genes); transcriptional factors and transcription regulators potentially involved in fruit ripening process were also included (Table 5). A high percentage of genes belonging to all investigated classes (83,5% for RSV and 80,7% for SM) showed variants in gene annotated datasets. Percentage of variation ranged between 76.1% (MEP/carotenoid pathway) and 92.3% (ethylene-related genes) in Vesuviano genome and between 73.9% (MEP/carotenoid pathway) and 87.1% (ripening differentially expressed transcription regulators) in San Marzano genome. The total number of varied genes is not indicative of specificity of variants for RSV or SM genes. In average 9 variations for gene have been identified, ranging from 6 to 15 in SM and from 6 to 19 in RSV (Table 5???). More of 500 common genes in SM and RSV (78%) showed variants respect to the reference genome. Vesuviano showed a higher percentage of specific polymorphism (5.6%) compared to SM (2.9%). Interestingly, four genes involved in ethylene biosynthesis varied only in RSV (Table 6). Lists of varied genes only in one of the two varieties or in both of them are reported in Supplemental Table â€¦6a and 6b, respectively. Interestingly, most variations are included in upstream and downstream regions, with a percentage value (on total variants belonging to each class) ranging between 23.60% and 61.80% for Vesuviano tomato and from 23.75% and 61.25% for San Marzano tomato (Figure 2). Most variants are predicted to have a modifier impact for all classes (ranging from 95 to 98.57% of variants for each class), followed by moderate (ranging from 0.62 to 3.75%), low (ranging from 0 to 1.78%) and high impact (ranging from 0 to 1.34%) (Supplemental Table 7). Putative impact of variants has also been evaluated, focusing on non-synonymous variations localized in coding sequence. Indeed, a low number of variants in exons with an average percentage value of 3.8% (ranging between 1.89% and 6.74%) for RSV and 4.1% for SM (ranging between 2.01% and 6.25%) were identified. The number of non synonymous variants found only in RSV or only in SM and the number of common variants are reported in (Supplemental Table 8). Overall, out a total of XXX genes analyzed belonging to the selected groups, 50 non-synonymous variations (%%%) in the coding sequence were predicted to be deleterious when translated as aminoacid substitutions (Table 7). The transcription factor genes class showed the higher number of deleterious substitution with 5, 6, and 11 genes in SM, RSV, and both genotypes, respectively. Similarly, deleterious variations for the protein function were observed in genes belonging to the cell wall and transcription regulators categories. For the class of transcription regulators involved in the fruit ripening only 3 genes showed a deleterious amino acid substitution, of these two were in common and one specific to the RSV genotype. Moreover, two genes in the ripening ethylene-related class showed the same deleterious variation in both genotypes. Finally, transcription factors involved in fruit ripening process exhibiting deleterious changes were only identified in the SM genotype. Quantitative differences in significant enriched classes among common and tomato variety-specific nonsynonymous coding variants were identified. Transcription factors nonsynonymous variants common to both tomatoes varieties enriched for three classes: - interleukin-6 receptor binding (GO:0005138), cytokine activity (GO:0005125) and RNA polymerase II transcription elongation factor activity (GO:0016944) (Figure 3a). Transcription regulators nonsynonymous variants in our tomato varieties gave significant differences in enrichment analysis when all variants (common ones plus variety-specific ones) were compared to only common variants. Vesuviano and San Marzano variants significantly enriched for GO terms regarding molecular functions and biological processes were found. All Vesuviano transcription regulation variants showed enrichment in molecular GO function for ethylene-binding class (GO:0051740), due to the presence of the gene encoding for the ethylene receptor (Solyc07g056580, a variant absent in San Marzano). All Vesuviano transcription regulation nonsynonymous variants showed enrichment in tight junction class (GO:0005923) because of the presence of the SNF2 helicase gene (Solyc03g095680, a variant absent in San Marzano). Figure 3b reports enriched classes among common and tomato variety-specific nonsynonymous Cell wall coding variants. Common nonsynonynous variants showed enrichment in molecular function GO terms corresponding to hydrolase activity, hydrolyzing O-glycosyl compounds (GO:0004553), galactosidase activity (GO:0015925), coniferin beta-glucosidase activity (GO:0047782), beta-galactosidase activity (GO:0004565). Vesuviano-specific nonsynonymous variants were enriched in fucosyltransferase activity (GO:0008417) and polygalacturonate 4-alpha-galacturonosyltransferase activity (GO:0047262). Indeed, the presence of nonsynonymous variants in the fucosyltransferase 7 gene (Solyc03g115830) and in the glycosyltransferase (Solyc07g055930) determined a private significant enrichment in those functional classes, when compared to common variants and all variants in San Marzano (common variants plus San Marzano specific variants).
Materials And Methods
DNA library preparation was carried out according to the Illumina TruSeq DNA sample preparation protocol. A total amount of 2.5 ug of genomic DNA was sonicated with Covaris S2 instrument to obtain 400 bp fragments, and an end-repair step was carried out and an "A" nucleotide was added to 3' blunt ends prior to ligate multiple indexing adapters to the DNA fragments. The adapter-ligated DNA fragments were size-selected from gel and DNA fragments with adapters on both ends were selectively enriched using 10 cycles of PCR. Quality control of libraries was performed using High Sensitivity DNA Kit (Agilent, Wokingham, UK) and an accurate quantification was made using qPCR with KAPA Library Quantification kit (KapaBiosystems, USA). Libraries were then pooled and sequenced using Illumina HiSeq 1000 and applying standard Illumina protocols with TruSeq SBS Kit v3-HS and TruSeq PE Cluster Kit v3-cBot-HS kits (lllumina, USA). Libraries were sequenced with an Illumina HiSeq 1000 sequencer (Illumina Inc., San Diego, CA, USA) and 100-bp paired-end sequences were generated.
The assembly was performed with the IMR/Denom ver. 0.3.3 pipeline [Gan, X. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419-23 (2011).] using default parameters and the SL2.40 tomato genome [The Tomato Genome Consortium & Information, S. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635-41 (2012).] as a reference. Iterative mapping was performed using Stampy ver. 1.0.17 [Lunter, G. & Goodson, M. Stampy: A statistical algorithm for sensitive and fast mapping.
Data coming from sequence variation research were filtered and analyzed through command line UNIX tools including awk, sort and join to retrieve data subsets of interest. Moreover, shell scripts were used when necessary. Filtered data allowed the specific comparison of genetic variations in genetic classes of interest and the isolation of lists of genes necessary for the present analysis.
Command line UNIX tools allowed also the identification of variants specific for SM or RSV tomatoes. The identification of variant genes belonging to specific classes, the extraction of non-synonymous variants in those genes, the quantification of variants in each genetic region and the correlation between variants and phenotype impact were performed. Variants with high impact on the gene function were also submitted to PROVEAN (ProteinÂ VariationÂ EffectÂ Analyzer) analysis. PROVEAN is a new algorithm which predicts the functional impact for all classes of protein sequence variations such as single amino acid substitutions but also insertions, deletions, and multiple substitutions (Yongwook Choi, Gregory E. Sims, Sean Murphy, Jason R. Miller, Agnes P. Chan, "Predicting the Functional Effect of Amino Acid Substitutions and Indels", PLOSOne, October 2012 | Volume 7 | Issue 10 | e46688).
Our attention was focused on nonsynonymous SNPs located in genes belonging to the aforementioned classes. To evaluate if significant enrichment was present in specific metabolic pathways, an enrichment analysis based on Gene Ontology (GO) terms classification (The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. May 2000;25(1):25-9.) was performed. We tried to associate a GO term to each gene containing a nonsynonymous coding variation; a median of 88,3% of variants was associable to GO terms for both varieties (some genes currently lack a GO term association). The data sets obtained were compared to the entire set of tomato genes with GO annotation (SOL Genomics. http://solgenomics.net/).
We performed a singular enrichment analysis (SEA) (Zhou Du, Xin Zhou, Yi Ling, Zhenhai Zhang, and Zhen Su agriGO: a GO analysis toolkit for the agricultural community Nucleic Acids Research Advance Access published on July 1, 2010, DOI 10.1093/nar/gkq310. Nucl. Acids Res. 38: W64-W70) which allows testing annotation terms one at a time against a list of interesting genes (Tipney H, Hunter L. An introduction to effective use of enrichment analysis software. Hum Genomics. 2010 Feb;4(3):202-6.).
We used a hypergeometric test to compare each class to the reference background of genes. Hochberg (FDR) statistical correction was applied and a significance level of 0,05 was set. The minimum number of mapping entries was set as 1 to observe every significant enrichment. The classes of genes submitted to enrichment analysis were transcription factors, transcription regulators and cell wall, since they were numerous enough to be statistically analyzed for enrichment test.
The SM and VS genomes are, to our knowledge, the first crop genomes to be sequenced and assembled using a patchwork strategy (sub-assembling mapping), demonstrating that this approach can be used to obtain a large genome highly contiguous assembly. The catalogue of tomato genetic variants produced enlarged the list available. The magnitude of the number of variants found is not comparable with earlier catalogue, based on trascriptome-sequencing, BAC sequencing, oligonucleotide arrays (Hamilton et al 2012 The Plant Genome 5:17-29; Sim 2012 e45520. doi:10.1371/journal.pone.0045520ref Sim et al 2011 Heredity; Blanca 2012 plos0ne;â€¦). In addition, our catalogue includes other types of sequence polymorphism that have previously been difficult to assess on a genome-wide scale.
The chromosome pseudomolecules obtained allowed studying with high accuracy genome colinearity useful for gene mapping and marker-assisted breeding. At 40X sequence coverage, we estimated that approximately 99% of the tomato genome could be genotyped. Our analysis yielded aprox 200,000 SNPs and more of 130,000 INDELs. A small fraction, approximately 3% of the novel variants, likely reflects errors in the reference genome (given that the accuracy of the reference genome is 99.99%) and that our false discovery rate (FDR) for SNP detection is approximately â€¦.% (â€¦), We found variation in the level of polymorphism among chromosomesâ€¦.. Indeed, the chromosome variation could reflects selection history rather than polymorphism discovery (Sim 2012). Genome-wide structural and gene content variations are hypothesized to drive important phenotypic variation within a species (MacHale 2012 plant phy 159â€¦.) .. However, in most cases presence/absence of deletion is common to both varieties, and thus we suspect that some percentage of the de novo assembled sequence represents sequence missing from the reference genome.
Based on the tomato gene model set, a limited number of unique (different structure???) genes was detected in each variety, while 1934 RSV and 1707 SM transferred annotations were affected by mutations potentially causing aminoacidic substitutions of unknown effect on the protein function. A subset of these SNPs was restricted to a single variety. The study of distribution of variants across the genomes of the sequenced variety is important. The short evolutionary timescale of selection within naturally occurring populations â€¦
Functional annotation analysis of genetic variants for quality related genes showed that classes were differentially affected by genetic variants, suggesting varying degrees of selection for genetic variants underlying biological processes. We also showed that the molecular nature of sequence variants and their position relative to genes influence the likelihood that they are functional. Functional variants contributing to small effect are significantly more likely to be intergenic; by contrast, larger effect are more likely to be caused by intronic/esonic variants. SNPs within the Vesuviano and San Marzano classes reflect the genetic diversity between these varieties. Four genes involved in ethylene biosynthesis varied only in RSV. This a long-storage tomato variety with extended shelf life. Since ethylene is involved in fruit maturation polymorphisms detected in these RSV genes should be further explored to validate their real involvment in extending shelf-life.
Analysis of molecular function Gene Ontology (GO) associations demonstrated that the gene category in assembled genome of both varieties are similar. Genes involved in cell wallâ€¦â€¦ contained most of the genetic variation, primarily SNPs (â€¦.) andâ€¦â€¦. The observed enrichment may be related to selection of Vesuviano for a long storage. Textural properties of fruits allow to extended shelf life (Siracusa J. Agric. Food Chem., 2012, 60).
The genetic variants identified here represent a significant addition to catalogue currently available for tomato studies for exploring genomic relationship. The genome sequences reported here and our variants catalogue will be useful to identify the molecular basis of gene complex patterns. Further analysis and functional studies will serve as a basis for understanding trait differences, which will facilitate the identification of markers for genomic marker-assisted breeding. The local genomes genotyping are useful for understanding the genomic features that distinguish modern to traditional varieties. It will also be useful for improving the utility of the tomato as a model for organoleptic quality. In addition, the genes we identified that are related to flavour perception could be used as markers for organoleptic quality tomato breeding, or they may be potential targets for genetic or non genetic manipulation. Collectively, the sequence we describe here will help dissect the path from sequence variant to phenotype. Variants specific for SM and RSV might be explored through a target resequencing approach in order to verify that they could represent variants characteristics for these two different tomato typologies.