Bioinformatic Prediction of Lipoprotein in Gram-positive Bacteria

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

With well established features of signal peptide and through the identification of lipobox containing invariant cysteine, bioinformatic analysis of these signal peptides is able to identify potential lipoproteins in Gram-positive bacteria.

Based on the sequence analysis of signal peptides of Gram-positive bacteria and Gram-negative bacteria, it was noted that lipoprotein signal peptides tend to be shorter that secretory signal peptides which indicate that the c-region is shorter and contains apolar amino acids. It implies that it is a continuation of the hydrophobic domain which is primarily based on the sequence conservation preceding the invariant lipid-modified cysteine.

Using the signal peptide sequences containing the lipobox, the Prosite consensus pattern syntax describing the sequence motif determining lipidation was constructed as {DERK}(6)-[LIVMFWSTAG](2)-[LIVMFYSTAGCQ]-[AGS]-C to recognise bacterial lipoprotein sequences. In this pattern expression, the allowed amino acids preceding the cysteine are at position -1 to -4 and the missing charged residues in h region are indicated as (D, E, R or K). This pattern expression has certain set of rules to be adhered that the cysteine must be between positions 15 to 35 and there has to be an arginine or lysine in the first seven positions of the sequence in order to place the pattern in correct orientation with n- region characteristic of signal peptide.

Large number of putative Lpps were identified through molecular genetic studies and quite number of these identified Lpps could be false-positive as they seem to contain a cysteine within the signal peptide sequence of exported proteins or proteins targeted for insertion into the plasma membrane. It was also noted that there were differences in the stretch of amino acids preceding the invariant cysteine in the signal peptides features of different bacterial taxa.

To derive to the prediction of lipoproteins using bioinformatic analysis, Sutcliffe and Harrington (2002) created a dataset of experimentally verified Gram-positive lipoproteins. These lipoproteins were identified based on several approaches: (1) metabolic labelling with radiolabelled fatty acid (palmitate); (2) Inhibition of Lsp (bacterial signal peptide) using the antibiotic globomycin; (3) Biochemical characterization of the purified protein and (4) Evidence that protein processing is disrupted by mutation in either Lgt or Lsp, or following site directed mutagenesis to replace the lipobox cysteine. Within this set of criteria and along with extensive review of scientific journals, 33 proteins were identified as proven bacterial lipoproteins

To further validate the above 33 lipoproteins indentified, several other bioinformatic sequence analysis were performed. Bacterial Lpps sequence were obtained from Prosite website and restricting the searches to Bacillus subtilis or S. pyogenes. Using the TMpred program, membrane spanning domains (MSD) in the above sequences were predicted, with a minimum length of the hydrophobic domain set 14aa and the signal peptides sequences were analysed using the Signal 2.0 (refined hidden Markov model version 2.0). For further clarification of the Lpp sequences, TopPred2 (transmembrane predictor) and DAS programs were used.

In the exclusion of the bacterial Lpps that are false positive, Bacterial Lpps N-terminal sequences were analysed individually using TMpred and SignalP. Lpps sequences which clearly denotes the absence of MSD and the extension of the most N-terminal beyond the invariant cysteine were known to be possible false-positives. TMpred was not justifiable as the CatC and the QoxA proven Lpps contained two additional MSD beyond their N-terminal lipid anchors. To encounter this, SignalP was used to analyse the sequences, bacterial Lpps where signal peptides features were absent and or the lipobox sequence which is internal to an h-region /MSD were confirmed to be false positive. Further clarification of these sequences were analysed using the TopPred2 and DAS and a general sequence was taken to position of invariant cysteine from the fist predicted MSD.

From the analysis of the signal peptide lipobox features from the above bioinformatic programs, it justified previous studies results in which there were high frequency of leucine in -3 position and alanine or serine at -2 position of the lipobox. In comparison with PS0013 pattern, there were obvious deviations and restrictions: alanine and glycine are the only amino acids indicated at -1 position and two proven Lpps had no arginine or lysine in the first amino acids which is contradictory to the PS0013 pattern.

Analysis of the lipobox sequences from the 33 experimentally verified lipoproteins, it was noted that n- regions had mean length of 6.7 +- 3.5 within the length of 3-15aa, the h-region length was 12.1+_ 2.3 aa within the length of 6-20aa. These details are in agreement with the findings of h-features indicated for the putative Lpp of B.subtilis. The mean invariant cysteine position was 24.0+_3.6 with the range of 17-33 aa length which proves the bacterial signal peptides are typically shorter compared to the signal peptides involved in directing protein export in Gram-positive. The mean length of the combined h-and c-regions to be 17.1 aa was noted as it is sufficient to span a typical bilayer membrane. From these data, it is noted the conserved residues are positioned at the outer face of cytoplasmic membrane in where the Lgt enzyme interacts with the invariant cysteine in the lipobox.

Since the PS0013 pattern is contradictory to certain proven Lpps in Gram-Positive bacteria as well as additional discriminations is likely to result due to the differences in signal peptide features of different bacterial taxa, a modified pattern, G+LPP was constructed for identifying the 33 proven bacterial Lpps. G+LPP pattern, is described as < [MV]-X(0,13)-[RK]-{DERKQ}(6,20)-[LIVMFESTAG]-[LVIAM]-[IVMSTAFG]-[AG]-C(using Prosite syntax).

In comparison of the G+LPP pattern stringency to that of PS0013 pattern in identifying putative bacterial Lpps, it provided a greater discrimination against the false-positive bacterial Lpps sequences when tested in B.subtilis genome. PS0013 pattern search identified 103 putative Lpps while G+LPP pattern identified 61 probable Lpps together with 6 proven Lpps in the above mentioned organism. Thus, the usage of G+LPP pattern to predict bacterial Lpps with a great confidence.

Both the Prosite pattern as well the G+ LPP pattern were applied to the S.pyogenes genome, retrieved from SWISS-PROT/TrEMBL database. The Prosite pattern search identified 36 sequences, out of which 9 were excluded as unlikely Lpps while the G+LPP pattern search identified 26 Lpps, out of which only one was known to be unlikely Lpps. Thus with these data, 8 out of 9 Lpps identified by PS00013 were excluded.

Both the search patterns identified previously identified and proven LppC Lpp as well several other Lpps that were identified and proven. A total of 24 Lpps identified in the S.pyogenes genome using the pattern search represents 1.5% of the S.pyogenes proteome which is comparable to the 36 Lpps identified in the S. pneumoniae genome.

Apart from identifying common previously identified Lpps by the both patterns, there were sequences which were picked up as possible putative Lpps specific to each pattern but not to both. In PS00013 pattern search, three putative Lpps sequences namely, Spy1972, Spy1361, Spy2066 were identified but not with the G+LPP pattern. Spy1972 n-region signal sequence is unusual in length and contains a LPXTG motif in the C-terminal. Spy1361 contains glutamine residue within the h-region and Spy 2006 signal sequence are not clear. Due to above differences in signal sequences, they were not picked by the G+LPP pattern which warrants evidence to prove that they are indeed putative bacterial Lpps. Likewise, G+LPP pattern search identified a signal sequence, Spy0903 but not with PS0013 pattern due to its extended signal features in the n-region.

Bacterial Lpps signal sequences that were missed by the both pattern searches were further analysed by using a combination of strategies namely, analysis of the S.pyogenes genome annotation, homologues searches of pneumococcal Lpps, PEDANT search and blast searches with low stringency. With the above searches, six possible false-positive bacterial Lpps were identified; Spy0163, Spy1592, Spy0778, Spy1306, Spy0457 and Spy2033.

Among these possible bacterial Lpps, four of them are substrate binding proteins (SBP). Spy0163 is a paralogue of Spy1228 but after refining the signal sequence of n region with alternative start methionine and lysine of this Lpp, it was accepted by G+LPP pattern and their motif were proven by Rosati et al. Spy1592 was excluded by the pattern search as it contains asparagine at -4 position in the signal sequence. However in the ORF of the S.pyogenes genome contains serine at this position which indicates that it is indeed a Lpp in some strains. Likewise the Spy0457 Lpp, peptidyl-prolyl isomerase of the cyclophilin family, contains an asparagine in the -4 position but it is highly homologous to the pneumococcal Sp0771 which indicates it may assist in the folding of exported proteins.

Both Spy0778 and Spy1306 were excluded in the pattern search as they contain proline in the -4 position which warrants further evidence. In the case of Spy2033, it has abnormal signal sequence of 64aa but intriguingly, its h-region ends within the invariant lipobox cysteine which tallies with the general consensus of a typical Lpp signal peptide. Its alternative start at M41 is consistent with sequence alignment as well as its homologue, the Streptococcus cristatus putative Lpp TptA which warrant further verification of this sequence.

Thus, with the application of two different pattern searches mentioned above to S.pyogenes genome, a list of lipoproteins that were confirmed to be putative Lpps was generated.