A Binomial Probability Distribution Model Based Protein Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

ABSTRACT: Mass spectrometry has become one of the most important technologies in proteomic analysis. Tandem mass spectrometry (LC-MS/MS) is a major tool for the analysis of peptide mixtures from protein samples. The key step of MS data processing is the identification of peptides from sample fragmentation spectra by searching public sequence databases. Although a number of algorithms to identify peptides from MS/MS data have been already proposed, e.g. Sequest, OMSSA, X!Tandem, Mascot, MassWiz, etc., they are mainly based on statistical models considering only peak-matches between experimental and theoretical spectra, but not peak intensity information. Moreover, different algorithms gave different results from the same MS data, implying their incompleteness and low stability. We developed a novel peptide identification algorithm ProVerB based on a binomial probability distribution model of protein tandem mass spectrometry combining with a new scoring function, making full use of peak intensity information, and thus enhancing the ability of identification. Compared with MASCOT, Sequest and SQID, ProVerB identified significantly more peptides from LC-MS/MS datasets than the current algorithms at 1% False Discovery Rate (FDR), and provided more confident peptide identifications. ProVerB is also compatible with various platforms and experimental datasets, showing its robustness and versatility. The open-source program ProVerB is available at http://bioinformatics.jnu.edu.cn/software/proverb/.

KEYWORDS: Protein Identification Algorithm, Tandem Mass Spectrometry, Statistical Model


Soft ionization techniques, e.g. Matrix-Assisted Laser Desorption Ionization (MALDI) 1 and Electrospray Ionization (ESI) 2 are able to maintain the integrity of peptides, thus empowering the mass spectrometry (MS) methods to perform proteomic analysis 3-5. Protein identification is the most fundamental algorithm in the data processing pipeline, since the sensitivity and accuracy of the identification algorithm is crucial for the downstream analyses . Generally, a peptide identification algorithm selects some peaks from the spectra, evaluates the similarity between the experimental and theoretical spectra, and then assigns the best match within the peptide error window as the result 8. The scoring models that evaluate the similarity between experimental and theoretical spectra should consider the three aspects: the number of peak matches, the number of peak consecutive matches and the intensities of matched peaks 9.

A number of peptide identification algorithms with various concepts for MS data are available, e.g. Mascot 10, Sequest 11, OMSSA 12, X!Tandem 13, MassWiz 14, Andromeda 15 and SQID 9. Mascot and Sequest are widely-used commercial software and commonly adapted search tools in protein identification15, however only limited details of these algorithms are released. Mascot is based on a probability model, whereas Sequest is based on an empirical scoring model that computes cross-correlation between experimental and theoretical spectra. Mascot selects the highest peak in each 14Da mass interval and keeps the peaks with their intensities above the threshold. Sequest takes consecutive matches of ions and intensity information into account, and then preprocesses the spectrum by keeping the top 200 peaks and separates the spectrum into ten bins for normalization15. X!Tandem uses a hyper geometric scoring model, while OMSSA is based on a Poisson scoring model to assess the significance of peptide match. They select 50 most intensive peaks by default. MassWiz divides the spectrum into 10 parts and selects 20 highest peaks from each part. SQID 9 keeps the top 80 peaks after deleting parent related peaks.

However, none of these algorithms accurately uses the entire information in MS experiments . They share similar methods to generate theoretical spectra. Considering six types of ions (b, y, b-H2O, b-NH3, y-H2O and y-NH3) in CID (Collision-Induced Dissociation) fragmentation mode, theoretical peak intensities are then set as three artificial values: 50 (b and y ions), 25 (b and y ions without H2O or NH3) and 10 (a ions) for a theoretical spectrum that does not fully reflect the experimental characteristics of mass spectrometry 11. Therefore, these algorithms do not use the peak intensity information obtained in the experiment to make the comparison of the experimental and theoretical spectra once the peaks are selected. SQID introduces the strength probability of the pair-wise amino acid fragments to consider the intensity match quality 9, but most identification algorithms based on statistical models are based only on peak-matches between experimental and theoretical spectra, but not utilizing peak intensity information. The incomplete use of MS information compromises the sensitivity, robustness and confidence of these methods.

To make full use of the MS information and to maximize the universality, we present here a novel identification algorithm, Protein Verification algorithm based on Binomial probability distribution (ProVerB), to enhance the accuracy, completeness and robustness of the peptide identification. We tested ProVerB against other algorithms using multiple MS datasets, showing its higher ability and confidence to identify peptides from the mass spectrometry at 1% FDR, significantly and stably higher than widely-used Mascot and Sequest.


2.1 Cell culture and protein extraction and trypsin digestion

Streptococcus pneumoniae D39 was cultivated in Todd-Hewitt broth with 0.5% yeast extract (THY) in a controlled incubator (37°C, 5% CO2). Cells were harvested at OD600 ~ 0.6 by centrifugation at 5000 Ã- g for 20 min at 4 °C. The harvested cells were washed three times with prechilled PBS (10 mM, pH 7.4) and then resuspended in lysis buffer (15 mM Tris-HCl, pH 8.0).18 The mixture was frozen-thawed for three cycles and then sonicated 10 times each for 30 sec. The lysate was centrifuged at 12000 Ã- g for 10 min at 4 °C. Protein concentrations were determined using Bradford assay and subjected to reduction with 10 mM DTT (37 °C, 3 h) and alkylation with 20 mM iodoacetamide (room temperature, 1 h in dark). Proteins were precipitated with four volumes of ice-cold acetone, pelleted by centrifugation and washed twice with ethanol. The pellet was resuspended in 25 mM Tris-HCl buffer (pH 7.6) and digested with sequencing grade modified trypsin (1:25 w/w; Promega, Madison, WI) at 37 °C for 20 h 19.

2.2 SCX-RPLC-MS/MS analysis

Dried peptides were reconstituted in 5% ACN/0.1% formic acid and analyzed with a Finnigan Surveyor HPLC system online coupled with a LTQ-Orbitrap XL (Thermo Fisher Scientific, Waltham, MA) equipped with a nanospray source. The peptide mixtures were loaded onto an SCX column and then eluted with 0, 0.05, 0.2 and 1 M NH4Cl. Each fraction flowed in a C18 column (100 μm i.d., 10 cm length, 5 μm-size resin (Michrom Bioresources, Auburn, CA)) using an autosampler. Peptides were eluted with a 0~35% gradient (Buffer A, 0.1% formic acid, and 5% ACN; Buffer B, 0.1% formic acid and 95% ACN) over 120 min and analyzed online with the LTQ-Orbitrap MS using a data-dependent TOP10 method 20. The parameters used for the mass spectrometric analysis were: spray voltage, 1.85 kV; no sheath and auxiliary gas flow; ion transfer tube temperature 200 °C; 35% normalized collision energy using for MS2; ion selection thresholds, 1000 counts for MS2; and activation q = 0.25 and activation time of 30 ms during MS2 acquisitions. The mass spectrometers were operated in positive ion mode with a data-dependent automatic switch between MS and MS/MS acquisition modes 19.

2.3 Mass spectrometry datasets

The datasets (Mix 3) of standard mixtures of 18 proteins obtained by five types of instruments (Agilent XCT, Thermofinnigan LTQ-FT, Thermofinnigan LCQ DECA, Thermofinnigan LTQ and Micromass/Waters QTOF Ultima, abbreviated below as Agilent, FT, LCQ, LTQ and QTOF, respectively) were downloaded (https://regis-web.systemsbiology.net//PublicDatasets/) to test the accuracy and dynamic range of algorithms . The LTQ-Orbitrap data obtained from the S. pneumoniae D39 protein identification containing more than 270,000 spectra served as training dataset for parameters of the model. The dataset of E. coli proteome 23 was downloaded from http://marcottelab.org/MSdata/Data_03/.

2.4 Data preprocessing

For S. pneumoniae D39 and E. coli dataset, the raw format files were converted to dta file format by Bioworks 3.31 (Thermo Finnigan, San Jose, CA) and the dta format files were merged to Mascot generic format (mgf) using the merge.pl program (http://www.matrixscience.com/downloads/merge.zip). For the 18 proteins dataset, the downloaded dta format files were merged to Mascot generic format (mgf) by the merge.pl program. The data format files were the input files of our method and Sequest software.

2.5 MS/MS database search

For target-decoy based FDR calculation, the forward and reverse databases were built for the three datasets as in Table 1.

Table 1. The databases used for MS/MS database search


S. pneumoniaeD39 database

18 proteins database

E. colidatabase

Protein sequences




Forward and reverse database




The Mascot generic format (mgf) files were searched using Mascot 2.3 (Matrix Science, London, UK) against the forward and reverse database. The dta files were searched using Sequest 28.13 (Thermo Fisher Scientific, Waltham, MA) and our algorithm ProVerB. The following search criteria were applied for all three algorithms: full tryptic specificity; two missed cleavages were allowed; cysteine (+57.021464 Da, Carbamidomethylation) was set as fixed modification, whereas methionine (+15.994915 Da, Oxidation) was considered as variable modification. The values of precursor ion mass tolerance and fragment ion mass tolerance were set as in Table 2 based on the instrument characteristics. The fragment ion tolerance of Sequest was set to 1.0 Da since it requires an integer value for m/z in the preprocessing of MS data 11.

Table 2. The parameters of precursor and fragment ion tolerance settings


ProVerB and Mascot


precursor ion tolerance

fragment ion tolerance

precursor ion tolerance

fragment ion tolerance


2.0 Da

0.5 Da

3.0 Da

1.0 Da


3.0 Da

0.5 Da

3.0 Da

1.0 Da


3.0 Da

0.5 Da

3.0 Da

1.0 Da


10 ppm

0.5 Da

10 ppm

1.0 Da


0.2 Da

0.2 Da

10 ppm

1.0 Da


10 ppm

10 ppm

10 ppm

1.0 Da

2.6 False discovery rate (FDR)

The peptide spectrum matches (PSMs) were extracted from the Mascot's data format file (.dat) with our in-house Matlab program and PSMs with the highest rank were exported to calculate FDR threshold. Sequest results were extracted from Sequest output files (.out) and PSMs with the highest rank and ∆Cn ≥ 0.1 were exported to calculate FDR threshold. ProVerB results and the extracted result of Mascot and Sequest were written to csv format files. All target and decoy scores with rank 1 PSMs were sorted in ascending order to calculate their FDR values by Kall's method . The different threshold is picked up to get the FDR from the following formula:

The score threshold was tuned to reach FDR ≤ 1%. The scoring functions vary in different search algorithms: for Mascot, the ion scores were sorted to calculate FDR when peptide length>=6; for Sequest and SQID, the Xcorr scores were sorted to calculate FDR by different precursor ion charge when peptide length>=6 and ∆Cn ≥0.1 and 0.05 respectively ; for ProVerB, the S scores (the final score of each peptide, see below) were sorted to calculate FDR when peptide length>=6.

2.8 Comparison of algorithms

All algorithms were compared according to the number of identified MS/MS spectra and unique peptides at FDR ≤ 0.01. The same rate of unique peptides and MS/MS spectra were further analyzed according to the different identification results in the three algorithms.


3.1 Peak selection in the spectra

Peaks closer than 1±0.25 Da are considered as isotope peaks and were filtered 9. The number of peaks for spectrum search was minimized in the algorithms to minimize random matches and enhance the accuracy. Sequest selected the highest 200 peaks from all fragment spectra 11. Mascot selected one peak from every 14 Da and the peak above a certain threshold as subsequent analysis peak 10. A maximum of 50 peaks wais used by X!Tandem 13. Also many other algorithms select the 1~10 highest ion peaks from the average 100 Da window for subsequent analysis 26-28. Our algorithm ProVerB selected top 6 ion peaks in 100 Da window since we considered the matching condition of six types of fragment ions, namely b, y, b-H2O, y-H2O, b-NH3, y-NH3. The fragment ions were selected only if their intensities are higher than 33% of the highest peak .

3.2 Theoretical spectra

A theoretical spectrum was generated based on the chemistry of b/y ions fragmentation. If the b, y fragment ions contained S,T,E,D ions, a loss of b-H2O or y-H2O was considered; if the b, y fragment ions contained R,K,Q,N ions, a loss of b-NH3 or y-NH3 was considered15. If the parent ion charge was +1 or +2, we considered +1/+2 fragment ion peaks. Only when the parent ion charge was not less than 2 and the fragment ions contained one of the R, K, H residues, +2 fragment ion peaks were considered 9.

3.3 Scoring function

Scoring function is the critical part of MS peptide identification algorithm. In our algorithm we applied binomial probability density function to consider three aspects: simple fragment ion match, consecutive fragment ion matches and the intensity of the b/y ion peaks.

3.3.1 The scoring function for simple fragment matches

It is difficult to propose a universal scoring function to fit various types of instruments and strategies, the variability in the fragmentation patterns, as well as the extent of fragmentation and intensities of the peaks . We solved this problem by establishing a binomial distribution statistical model based on the nature of matching itself, independent of all the experimental factors listed above. The match probability of experimental and theoretical fragment ions reflects the confidence of the match:


p = probability of random match.

p0 = 0.06. From each 100 Da interval we selected the highest 6 peaks, therefore the random match probability is 0.06.

f = ratio between the number of selected peaks of spectrum in the residue peaks and the range of experimental mass spectrometry in m/z value.

n = number of theoretical fragment peaks.

k = number of matched peaks in the experimental spectrum.

P = probability where k peaks matches in the n theoretical peaks, calculated by the binomial distribution probability density function.

3.3.2 The scoring function for consecutive ion matches

Multiple consecutive ion matches were converted into a series of ion pairs matches: x consecutive ion matches were converted into x-1 ion pairs, and the matching probability of each pair was calculated as above. For example, if b1, b2 and b3 ions were consecutively matched, this consecutive ion match was converted into two consecutive pairs: b1-b2 and b2-b3. Additionally, the probability of consecutive fragment matches was calculated as follows:


p1 = probability of the consecutive fragment matches

P1 = probability where there are k1 peaks consecutive matching in the n1 consecutive theoretical peaks, calculated by the binomial distribution probability density function

n1 = number of the consecutive matches in the theoretical spectrum

k1 = number of the consecutive matches in the experimental spectrum

r is the background constant. Trained from large amounts of identification results in S. pneumoniae D39 dataset, we derived r = 0.09083 using the following formula:

It reflects the probability of actual consecutive matching. It is necessary to add a background value for correction of the consecutive matches of more than two ions. Nevertheless, the probability of consecutive matches of three ions was far less than two ions, resulting in a small r value.

3.3.3 The scoring function for spectrum intensity of b/y ion peaks

Another novelty of our algorithm is to consider peak intensity quantitatively for identification. The peak intensities of b/y ions generated from the same peptide were correlated based on their physical and chemical properties9. This provides important additional information to filter the noise and increase the sensitivity of identification. We introduced matrices Bij and Yij based on the chemical properties of bonds between each amino acid pair (AAP). The matrices Bij and Yij were calculated using the S. pneumoniae D39 dataset and listed in Supplementary Table 1.



M_I = the number of AAP b-ions or y-ions matches of the highest two peaks in every 100 Da.

M_E = the AAP b-ions or y-ion matching number of the top six peaks in every 100 Da.

i and j stand for amino acids, ranging from 1 to 20.

Peptide score function is defined as:


k2 = number of the peaks matching b/y-ions

n2 = number of b/y-ions in theoretical spectra

T = the sum of Bij and Yij of the AAP b/y ion peaks which are the highest two peaks in every 100 Da and matched to amino acids i and j.

c = number of the highest two peaks matching b/y ions in every 100 Da.

f = ratio between the number of selected peaks and the m/z range of experimental mass spectrometry. A constant 0.02 is added since the random match probability of two ions in 100 Da interval is 0.02.

Here, p2 is the random match probability of b/y ions match concerning the peak intensity. indirectly reflects the peak intensity match quality of b/y ions and T should be greater than c.

A detailed example is included in the supplementary materials.

3.3.4 The overall scoring function and background value

The three scores above were then used to calculate the overall peptide score PEP_S:

PEP_S = -10∙lg(P∙P1∙P2)

To investigate the influence of the P1 and P2, we plotted the peptide number against the FDR considering these three scores P, P1 and P2 progressively by applying three different scoring methods -10∙lg(P), -10∙lg(P∙P1), -10∙lg(P∙P1∙P2), in S. pneumoniae D39 dataset (Supplementary Figure 1). The curves showed that both the consecutive ion matches P1 and the intensity matches P2 contribute to the improvement of identification.

The peptide score can be affected by additional information including peptide length, number of modifications, number of missed cleavages, charge of precursor ions, thus necessitates a correction 15. A background value B was subtracted from PEP_S:

S = PEP_S - B

The correction values for different classes of peptides were derived from S. pneumoniae D39 dataset with the Bayesian learning method. The statistical probability=0.5 of PEP_S from Bayesian network means that the forward and reverse peptide cannot be distinguished, where. we defined S = 0. In this case the background value B equals the PEP_S. The background values B in different classes of peptides are listed in Table 5. S is the final score of each peptide.

Table 5. Background values learnt from Bayesian networks

Background values type






Modification sites





Missed cleavage sites





Peptide Length

precursor ion mass*0.018

Background values type

Charge state





Parent ion charge




3.4 Comparison of ProVerB with Mascot, Sequest and SQID

3.4.1 Number of identified peptides and spectra

We compared our algorithm ProVerB with two widely-used MS identification algorithms Mascot and Sequest for their sensitivity in Matlab version. The test datasets include in-house generated S. pneumoniae D39 dataset, E. coli dataset and the dataset from 18 standard protein mixture.

Under the criteria FDR ≤ 0.01 , all three algorithms were able to identify more than 3000 peptides from the S. pneumonia D39 dataset (Fig. 1). The Venn diagram shows that most of the peptides (2702) and spectra (81243) could be identified by all three algorithms. The overlap ratio of identified peptides and spectra from Mascot and ProVerb was as high as 91.0% and 97.9%, showing a good consistency with other algorithms. Clearly, ProVerB identified more peptides and spectra than Mascot and Sequest. The advantage of ProVerB remained the same in the three E. coli datasets as well, showing its unwavering power of identification (Figs. 2A and 2B). We also compared ProVerB with SQID, which also considers the peak intensity information. Compared with SQID result (3441 peptides and 96542 spectra), the overlap ratio of identified peptides and spectra from SQID and ProVerB was as high as 84.6% and 87.3%. The comparison plot of peptide identification number versus FDR for the four algorithms showed that ProVerB identifies the most peptides within the FDR range of 0.5%~3% (Supplementary Figure 2).

Fig. 1. Comparison of Mascot, Sequest and ProVerB using S. pneumoniae D39 dataset. (A) Number of identified peptides. (B) Number of identified spectra.

Fig. 2. (A) The number of identified peptides from the E. coli datasets using ProVerB, Mascot and Sequest. (B) The number of identified spectra from the E. coli datasets using three algorithms.

Next, we tested the adaptability of ProVerB to various types of MS instruments, including Agilent, FT, LCQ, LTQ, QTOF, using the downloaded 18 standard protein MS spectra. Again, ProVerB identified significantly more peptides and spectra than Mascot (up to 45.7%) and Sequest (up to 41.7%) in all instruments except Agilent (Figs. 3A and 3B). These data clearly indicate that ProVerB provided mostly significantly higher ability to identify peptides and spectra than the other two identification algorithms and it is also applicable in a wide variety of MS instruments.

We used the background value r = 0.09083 in all analyses above. However, the precursor ion charge and peptide length may influence the background value r slightly (Supplementary Figure 3). To address how much the fluctuation of r value influences the identification performance, we tested ProVerB using the two dimensional r value matrix (Supplementary Table 2). In this case ProVerB identified only one peptide more than using the average r value, and 98.8% of the peptides overlap under two settings. Therefore, the precursor ion charge and peptide length generate only trivial influence, if at all. The r values vary depending on the instrument type: Agilent, FT, LCQ, LTQ and QTOF give r values 0.1261, 0.1475, 0.1328, 0.1236 and 0.09006, respectively. We tested ProVerB using r = 0.1475 to identify the dataset generated by FT, which deviates most from the average r value, and resulting in only one more peptide identified and all the other identified peptides were the same. These results confirmed that the average value r = 0.09083 can be used universally in ProVerB, insensitive to the precursor ion charge, peptide length and instrument type.

Fig. 3. (A) The number of identified peptides from the 18 standard protein dataset obtained from five types of MS instruments using three algorithms. (B) The number of identified spectra from the 18 standard protein dataset obtained from five types of MS instruments using three algorithms.

3.4.2 The number of identified high-confidence peptides

Since different algorithms give different identification results, a cross-check of results from different algorithms may reveal the confidence of identified peptides. The high-confidence peptides and spectra characterize the quality of identification of an algorithm 14. To calculate the number of high-confidence peptides, we first calculated the overlaps of the identified peptides of each two algorithms (Supplementary Table 3). The high-confidence peptides can be calculated as , where A, B and C represent the identified peptides or spectra of ProVerB, Mascot and Sequest, respectively. The fraction of high-confidence peptides identified by these three algorithms are listed in Table 3.

Table 3: The fraction of high-confidence peptides of the three algorithms














D39 dataset
















18 Standard proteins mixture












































































E. colidataset


E. coli1
















E. coli2
















E. coli3

















In most cases, ProVerB undoubtedly exceeded Mascot and Sequest in identifying high-confidence peptides, showing its unmatched, robust and instrument-/dataset-independent identification power (Supplementary Fig. 4).

3.4.3 Correlation between ProVerB and Mascot scores

The scores in the MS identification algorithms quantitatively reflect the significance of the identification. We then compared the score values of ProVerB and Mascot using the S. pneumoniae D39 dataset (more than 270,000 spectra) (Fig. 4). The Pearson correlation coefficient reached 0.8124 (p<10-16), showing a good correlation between the two algorithms. This validates that ProVerB provided scoring scheme compatible with Mascot.

Fig. 4: The scatter plot of ProVerB and Mascot scores identifying the S. pneumoniae D39 dataset.

4. Conclusions

The boom of the proteomics applications and the wide variety of mass spectrometry technology on peptide identification necessitate a versatile and accurate peptide identification algorithm. In this paper, we present a new algorithm ProVerB based on a novel binominal distribution statistical model, and validated its accuracy, robustness and compatibility. Additionally, ProVerB is an open source program so that no algorithmic detail is hidden as in the commercial software packages. Users may tune the parameters according to their specific experimental setup to optimize the results. Also, it can be compiled in various operating systems with a user-friendly graphical user interface. Although ProVerB does not support ECD/ETD mass spectrometry data, we believe that ProVerB will find its broad application in the proteomics studies and provide more robust and accurate results than two commercial algorithms, producing a more solid base of data for the downstream analyses.


Author Contributions

Chuan-Le Xiao, Gong Zhang and Qing-Yu He conceived this project, Theoretical model of ProVerB was developed by Chuan-Le Xiao and Xiao-Zhou Chen. The algorithm was originally programmed by Yang-Li Du. The test result was carried out by Chuan-Le Xiao and Gong Zhang. The experimental part of D39 dataset was accomplished by Xuesong Sun.

Funding Sources

This work was collectively supported by National "973" Projects of China (2011CB910700), National Natural Science Foundation of China (20871057, 31000373 and 31200612), the Fundamental Research Funds for the Central Universities (11610101 and 21611201), "211" Projects and the Pearl River Rising Star of Science and Technology of Guangzhou City (2011048b).


We are grateful to Shuai Liu, Chao Ma for the help with programming ProVerB and for the technical hints on performance optimization.

Supporting Information

Three supplementary tables and supplementary notes that support this article are available free of charge via the Internet at http://pubs.acs.org. The ProVerB program, source code and test dataset can be downloaded at http://bioinformatics.jnu.edu.cn/software/proverb/.