Normalization Of Spectral Counts Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Normalization of spectral counts in label free shotgun proteomic approaches is important in order to get reliable quantification. Two different spectral count normalization methods, total spectral count (TSpC) normalization and normalization to specific proteins were valued utilizing spectral count data obtained from conidia of the rice blast fungus Magnaporthe oryzae. Myoglobin (equine) and Ovalbumin (chicken) were spiked into the protein mixture as control proteins. TSpC normalization revealed best correlation and lowest variance. Additionally the effect of sample complexity on the numbers of proteins identified was examined by pooling some of adjacent GeLC fractions and analyzing on the same column on the nanoLC-Orbitrap.

The fungus Magnaporthe oryzae (M. oryzae) is a destructive rice blast fungus, destroying millions of hectares of rice each year and resulting in billions of dollars loss values. Dean and coworkers were the first in combining two dimensional polyacrylamide gel electrophoresis (2D-PAGE) fractionation and a matrix assisted laser desorption ionization-time of flight mass spectrometer identifying 150 yeast proteins in their study in 1996. Twelve years later, again Mann and coworkers identified 3639 proteins using 1D-PAGE and 3987 yeast proteins using OFFGEL both coupled to an online liquid chromatography (LC) linear trap quadrupole (LTQ)-Orbitrap mass spectrometer.

A variety of analytical strategies exists for the fractionation of complex mixtures prior to MS analysis and is either performed at the protein or peptide level, each with their own advantages and disadvantages. Aside from electrophoresis based fractionation methods (2D-PAGE and 1D-PAGE, OFFGEL and GelFree), chromatography based methods have also been developed. Offline strong cation exchange (SCX) prior to nanoLC-MS and multi dimensional protein identification technology (MudPIT), combining SCX and reversed phase (RP) online, are both common methods used in proteomics.

RP-RP at two different pH values has become an option in separation techniques as it was found to be identical in orthogonality to SCX-RP. The orthogonality in RP-RP arises only from the difference in the pH values since the column used in both dimensions are C18 columns. In this study we performed peptide level offline RP high pressure liquid chromatography (HPLC) fractionation at micro-flow/min rates of a S. cerevisiae whole digest prior to RP nanoLC-MS (LC-LC-MS), using the same mobile phase and pH and compared this method to the performance of only nanoLC-MS analysis of the whole digest. The total number of proteins and protein groups identified from each analysis were compared.

2. Experimental Section

2.1 Sample preparation

M. oryzae conidia were harvested from 8 days old minimal medium plates. Three biological replicates each containing 2 million conidia were created and pooled to account for biological variance. Conidia were lysed using a 1X PBS, 2 M Urea, 0.1 % SDS buffer and performing bead beating. Protein concentration was determined by a BCA assay. Sample 1 and 2 were prepared and processed on different days. Myoglobin (chicken) (Sigma Aldrich, St. Louis, MO) and Ovalbumin (equine) (Sigma Aldrich, St. Louis, MO) were chosen as spike-in proteins and 25 ng of each was added to 50 µg of protein for each sample. The samples were loaded onto 10-20% 1D-SDS PAGE gels. After Coomassie blue staining (Bio-Rad, Hercules, CA) 10 bands were excised and in-gel digestion was performed on each fraction according to protocol.

2.3 NanoLC-MS

The nanoLC-1D system from Eksigent (Dublin, CA) was operated at room temperature. A 75 μm i.d. IntegraFrit capillary (New Objective, Woburn, MA) trap was packed in house to 3 cm with Magic C18AQ packing material (Michrom BioReasources, Auburn, CA). A 75 Î¼m i.d. PicoFrit capillary column (New Objective, Woburn, MA) was packed 15 cm with the same packing material. Separation was carried out using a continuous, vented column configuration as previously reported by our group[24]. A 2 Î¼l (200 ng) sample was injected into a 10 Î¼l loop and loaded onto the trap with approximately ten washes prior to analytical separation. The flow rate was set to 350 nl/min for separation on the analytical column. A 5 minute column wash was performed at 2 % B followed by a 1 hour linear gradient 10-40 % B. The gradient was ramped up to 90 % B in 1 minute and maintained for 10 minutes. Two minutes were required to establish 2 % B and this was maintained for two minutes. Three technical replicates of each sample were run.

A hybrid LTQ-Orbitrap MS (Thermo Fischer Scientific, Bremen, Germany) was used to preform MS analysis. The automatic gain control (AGC) limit for the Fourier Transform MS (FTMS) was set to 1Ã-106. The maximum injection time for the FTMS was 500 ms and 80 ms for the ion trap. The AGC limit for the ion trap was 8Ã-103. The resolving power was set to 60,000fwhm at m/z 400 and 8 data dependent MS/MS events were performed for ions with charge states ≥+2. An exclusion time of 3 minutes was applied for ions selected ones. The normalized collision energy was 35%. Lock mass calibration using polydimethylcyclosiloxane present in ambient laboratory air (m/z 445.120025) was enabled. Additionally external calibration was performed following manufacturer instructions and using manufacturer's calibration mix.

2.4 Data analysis

Data analysis was performed by searching the .raw files against the target reverse M. oryzae database (MG8_GeneCall10.fasta) from the Broad Institute using MASCOT Distiller version 2.3.01 (Matrix Science Inc., Boston, MA). Parameters in MASCOT were described in our previous paper. Final analysis at 1% FDR was performed in ProteoIQ version 2.1.01_SILAC_beta08 (BioInquire, Athens, GA).

3. Results & Discussion

The experimental workflow is diagramed in Figure 1. Sample 1 and sample 2 were prepared and processed on different days. M.oryzae conidia protein spiked with Myoglobin (equine) and Ovalbumin (chicken) was loaded onto 1D SDS PAGE gels and in-gel digestion was performed. Ten fractions of each sample were analyzed by nanoLC-MS using different column and traps for each sample. Additionally 1/3 of the adjacent fractions of sample 2 were combined to sample 2' and were analyzed by nanoLC-MS.

Sample 2 resulted in 32500 TSpC and 1477 proteins identified (see Figure 2A and B, and Table 1 and 2). Increasing the sample complexity as it was the case for sample 2' resulted in less identification. The number of TSpC for sample 2' was 16209, about half of the TSpC of sample 2 (see Figure 2B, and Table 2), which was expected due to the decrease of number of fractions (10 fractions for sample 2 and only 5 fractions for sample 2'). Only 1087 proteins were identified from sample 2'. But combined with sample 2 in ProteoIQ the number of proteins identified for sample 2' increased to 1392. All proteins identified from sample 2' were within the total population of protein identifications from sample 2. The number of identifications for sample 2 increased as well by 114 proteins. The combination with another sample simply increases the confidence in the spectra taken and thus increases the number of proteins identified. Same happened during combination of sample 1 and 2. Sample 1 yielded 1185 proteins identified on its own, and 1478 proteins when combined with sample 2. The number of proteins identified for sample 2 increased to 1523. Sample 1 contained 13 and sample 2 58 proteins that were not shared. The difference in protein identifications between sample 1 and 2 was expected, due to sample handling on different days, reagent quality, gel to gel variance and different column and trap.

The SpC Scatter for the proteins is plotted in Figure 3A and Figure 3B. A slope close to 1 for sample 1 and 2 and a slope close to 0.5 was expected ideally for sample 2 and 2'. The unnormalized scatter plots can be normalized to a slope close to 1. TSpC normalization results in best correlation in both cases compared to normalization to the spike proteins Ovalbumin and Myoglobin. Normalization factors for the normalizations can be found in Table 1-4.

Examining the spike proteins closer shows that Myoglobin was identified with ca. 20 SpC in each sample and that Ovalbumin was identified with 63 SpCs in sample 1, 88 SpCs in sample 2 and 45 SpCs in sample 2' (Figure 4A, Figure 4B, Table 5A-C, Table 6A-C). Sequence coverage of 54.55% for Myoglobin and 43.78% for Ovalbumin was observed (Figure 5A and Figure 5B). But despite good sequence coverage and having two spike proteins that mirror different protein attitudes, TSpC normalization results in higher correlation for the NSpC scatter plots and also reduces the variance best as shown in Figure 6.

The quantitation vs. sampling plots in Figure 7A and Figure 7B exhibit a relatively narrow distribution for sample 1 and 2 and an even narrower distribution for samples 2 and 2' as expected. A few proteins outside the 1 and -1 log2 limits indicating normally up and down regulation for label free quantification studies were observed. Dividing the proteins into 3 categories 5-50 TSpC, 50-100 TSpC and ≥100 TSpC reveals that the falsely down regulated proteins identified for sample 1 are all in the range of 5-50 TSpC (Figure 8A-C). The distribution is narrowing down as the TSpC numbers increase. It follows that changes for proteins with higher TSpC numbers are more difficult to detect using the 2-fold change limits and that different models need to be developed to take care of that issue.

4. Conclusions

TSpC normalization and normalization to specific proteins for label-free spectral counting data was investigated on the example of M. oryzae. Normalization to TSpC revealed best correlation and lowest variance for all data sets. We have also shown that combination of different samples from the same organism yields to more protein identifications and that it is not necessarily required to have as many fractions as if one would need while analyzing a single sample.

The application of 2-fold change limits to identify up- and down-regulation in the proteome as it is done for label-free quantitation is satisfactory for proteins with TSpC numbers 5 to 50, but for higher TSpCs better models have to be developed.


The authors would like to thank the National Science Foundation (Grant Number MCB-0918611), the W. M. Keck Foundation, and North Carolina State University for financial support.

Figure Captions

Figure 1 - Experimental workflow. The samples were prepared and processed on two different days. Twenty-five ng Myoglobin and 25 ng Ovalbumin were added to 50 µg of M. oryzae conidia protein. One dimensional SDS-PAGE separation and in-gel digestion were performed. Ten fractions from day 1 and 10 fractions from day 2 were analyzed in triplicate by nanoLC-MS using different traps and columns for each sample set. Additionally adjacent fractions from day 2 sample were pooled and also analyzed in triplicate by nanoLC-MS. The data was searched in MASCOT and compiled in ProteoIQ.

Figure 2 - Venn Diagrams showing the total protein numbers identified at 1% FDR in each sample by handling separately as well as by combining the data sets. Sample 2 yielded the highest number of identified proteins with 1477. (a) Combining sample 1 and 2 resulted in 1465 shared proteins. (b) Combining sample 2 and 2' yielded in 1392 shared proteins, where all proteins in sample 2' were already identified in sample 2.

Figure 3 - Unnormalized SpC Scatter for Proteins with SpC between 0-100. (a) Sample 2 vs. sample 1. (b) Sample 2 vs. sample 2'. A narrower distribution for (b) is observed due to same sample origin and difference in being more complex. Myoglobin has 21 SpC in sample 1 and sample 2. In sample 2' it is identified by 20 SpC. Ovalbumin has 63 SpC in sample 1, 88 SpC in sample 2 and only 45 SpC in sample 2'.

Figure 4 - Sequence coverage for (a) Myoglobin and (b) Ovalbumin.

Figure 5 - Unnormalized SpC data and average NSpCs (a) of sample 1 and sample 2 versus NSpCs for each protein (b) of sample 2 and sample 2' versus NSpCs for each protein are plotted. Total spectral count normalization shows a higher correlation than normalization to the spike-in proteins.

Figure 6 - Quantitation/sampling correlation plots (a) for sample 1 and 2. (b) for sample 2 and 2'. A narrower distribution is observed for sample 2 and 2'.

Figure 7 - Quantitation/sampling correlation plots for sample 1. (a) For proteins with total spectral count numbers between 5 and 50 a relatively broad distribution is observed. (b) Proteins with spectral counts between 50 and 100 have a narrower distribution. (c) Proteins with spectral counts higher than 100 have the narrowest distribution.

Figure 8 - Average number of NSpC is plotted against % Std deviation for sample 1 and 2. TSpC normalization, normalization to both of the spike-in proteins Myoglobin and Ovalbumin and normalization to each of the spike-in proteins was performed. Lowest variance is observed for TSpC normalization.