De novo structural modeling and computational sequence analysis

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Bacteriocins produced by different groups of bacteria are ribosomally synthesized peptides or proteins with antimicrobial and specific antagonistic bacterial interaction activity. Rhizobium leguminosarum is a Gram-negative soil bacterium which plays an important role in nitrogen fixation in leguminose plants. Bacteriocins produced by different strains of R. leguminosarum are known to impart antagonistic affects on other closely related strains. Recently a bacteriocin gene has been isolated from R. leguminosarum bv. viceae strain LC-31. Our study aims towards computational proteomic analysis and 3D structural modeling of this novel bacteriocin protein encoded by the above said gene. Different bioinformatics tools and machine learning techniques were used for protein structural classification. De novo protein modeling was performed by using I-TASSER server. The final model obtained was accessed by PROCHECK and DFIRE2, which confirmed that the final model is reliable. Until complete biochemical and structural data of bacteriocin protein produced by R. leguminosarum bv. viceae strain LC-31 are determined by experimental means, this model can serve as a valuable reference for characterizing this multifunctional protein.

Keywords: Bacteriocin; Rhizobium; Protein modeling; Nodulation; Symbiosis; Nitrogen fixation


Bacteriocins are proteinaceous toxins secreted by Gram-positive and Gram-negative bacteria. They have a narrow inhibitory spectrum against bacteria that are closely related to the producing bacterium. However, many of the bacteriocins produced by lactic acid bacteria (LAB) have inhibitory spectra spanning beyond the genus level and can potentially defend unwanted microflora [13, 22, 28]. Bacteriocins were first identified almost 100 years ago as a heat labile product present in cultures of E. coli V and were toxic to E. coli S. These were given the name of colicin to identify the producing species [9]. Since then, bacteriocins have been found in all major lineages of bacteria and, more recently, have been described as universally produced by some members of the Archaea [23-24]. Bacteriocins are usually ribosomally synthesized. The genes encoding bacteriocin production and immunity are organized in epichromosomal operon clusters but some are also chromosomally encoded, such as Lactobacillus sakei 5, which produces two chromosomally encoded bacteriocins [2, 19, 26]. These polypeptides have attracted much attention due to their potential use as antibacterial agents for the treatment of infections, as well as preservation of food and animal feed. The bacteriocin family includes a diverse number of proteins in terms of size, microbial target, mode of action, release, and immunity mechanisms, and can be divided into two main groups: those produced by Gram-negative and those produced by Gram-positive bacteria [8].

The symbiosis between legumes and N2-fixing bacteria (rhizobia) is of huge agronomic benefit, allowing many crops to be grown without nitrogenous fertilizers. It is a sophisticated example of coupled development between bacteria and higher plants, culminating in the organogenesis of root nodules [35].

R. leguminosarum is a Gram-negative bacterium living in symbiosis with leguminous plants in which it induces nitrogen-fixing root nodules [29]. These strains have been shown to produce bacteriocins that have been characterized as small, medium or large based on their assumed sizes and diffusion characteristics. Large bacteriocins have been shown to resemble defective bacteriophages [16, 20]. Small bacteriocins were found to be chloroform soluble and heat labile and to have molecular masses of less than 2,000 daltons [31]. Small bacteriocins were shown to be acylated homoserine lactone compounds related to quorum-sensing molecules [11, 27]. Very little is known about medium bacteriocins produced by R. leguminosarum. The ability of soil bacteria to produce bacteriocins, defined as specific, nonself-propagating inhibitory agents causing antagonism between closely related strains, and bacteriocinogenic activity has been described in almost all rhizobial species [30]. As bacteriocins act as pivotal substance in specific antagonistic bacterial interaction, they can be potentially used to control bacterial plant diseases by exerting their lethal effects on bacteria of the same or related groups. Thus bacteriocins have most of the properties considered desirable for microbial control [10, 23]. Later on, it has been identified that rhizobial species are not only involved in symbiotic nitrogen fixation but also exploit range of mechanisms in direct or indirect manner to compete in nodulation and plants growth stimulation [12]. Despite of bacteriocins antibacterial activity the exact mechanism of their action is still vaguely understood. However, protein models of bacteriocin can be created for the deeper insights into its structure and function. In the recent years, protein modeling has become a promising tool with which we can predict structure of those proteins which are normally difficult to solve.

The aim of this study was to perform computational sequence analysis and 3D structural modeling of a bacteriocin protein produced by R. leguminosarum bv. viciae strain LC-31. Understanding the bacteriocin 3D structure could help us to understand how these extracellular proteins may contribute to nodulation, inhibition or suppression of other pathogenic plant bacteria and related processes that are known to be influenced by R.leguminosarum strains.

Material and Methods

Sequence Data

Recently, isolation and characterization of the novel bacteriocin gene produced by R. leguminosarum bv. viceae strain LC-31 was performed [18]. Work performed by that group showed that the bacteriocin gene has three components; RzcA, RzcB and RzcD. While RzcB and RzcD are required for bacteriocin secretion, RzcA was found to actually encode the bacteriocin protein. By using recombination and cloning techniques, the nucleotide sequence of the RzcA fragment from R. leguminosarum bv. viceae strain LC-31 was determined to be 5'-TACGAAACTCTGGACGGCTCACCAATGCCGAAGCATCTCGTTGCCGA CGCATCACTTATTTATCGGCCCACCAATGCCACAT-3'.

In this study, we have used this nucleotide sequence as a query for homology searching and computational modeling of the bacteriocin protein from R. leguminosarum bv. viceae strain LC-31.

Protein sequence and structure analysis

Nucleotide sequence translation

For the prediction of structural properties and the 3D structure of any protein, we first require its amino acid sequence. Up until now, the protein sequence of this specific bacteriocin gene has not been uploaded to any database, therefore we used Translate [33] from Expasy to translate the query nucleotide sequence into its protein sequence.

Primary and secondary structures

ProtParam [33] was used to predict physiochemical properties of the translated protein sequence. The parameters computed by ProtParam include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY). Information regarding the secondary structure of proteins allows us to predict fold recognition and ab initio protein structures, classification of structural motifs, and refinement of sequence alignments. Secondary structure predictions (helix, sheets, and coils) were made by using different types of neural networks. In comparison to other prediction methods, machine learning approaches such as neural networks have a major advantage, as these methods use training sets of solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. The hierarchical neural network (HNN) secondary structure prediction method used in this study is based on artificial neural networks [4]. Two networks have been implemented in this proGram; these were the sequence to structure network and the structure to sequence network. JPred3 [3] is another secondary structure prediction server that uses a double neural networks approach. The recently updated Jnet algorithm provides a three-state (α-helix, β-strand and coil) prediction of secondary structure at an accuracy of 81.5%. Another server used for secondary structure predictions is PSIPRED [17]. It incorporates two feed-forward neural networks which perform an analysis on output obtained from Position Specific Iterated - BLAST (PSI-BLAST). Using a very stringent cross validation method to evaluate the method's performance, PSIPRED achieves an average accuracy of 80.7%.

Subcellular localization prediction

Determining subcellular localization is important for understanding protein function and is a critical step in genome annotation. PSORTb v3.0.2 [36] used here is the most precise bacterial localization prediction tool. It can make localization predictions for both Gram-positive and negative bacterial sequences and Archaea sequences.

3D structural modeling and assessment

The 3D structure is the final shape that a functional protein assumes. Various bonding interactions between the side chains on the amino acid residues determine the tertiary structure of the protein. These interactions include salt bridges, disulfide bonds, hydrophobic interactions and hydrogen bonds. No high resolution X-ray or NMR structure is available for the bacteriocin produced by R. leguminosarum bv. viceae strain LC-31. Therefore, we modeled the 3D structure using two approaches: homology modeling and de novo structural modeling. Homology modeling works best when the query matches an already present high resolution structure from the database with more than 60% sequence similarity. In cases where no good template is available, threading is done to predict the 3D structure of the target protein. For homology modeling, we used an academic version of MODELLER v 9.2 [5] . In the case of de novo structural modeling, I-TASSER [25] was used. Furthermore, predicted 3D structures were evaluated by PROCHECK [15] and DFIRE2 [34] and the calculation of disulfide bond formation was checked by DiANNA [6] and DISULFIND [1]. Structures visualization was performed by UCSF Chimera 1.5 [21].

Results and Discussion

Sequence translation and homology searching

The nucleotide sequence of R. leguminosarum bv. viceae strain LC-31 RzcA was obtained [18] and then subjected to nucleotide sequence translation tools for determination of the bacteriocin protein sequence. A total of six reading frames were generated. Stop codons were observed in all of the three 3'-5' reading frames (data not shown) and they were discarded. The remaining 5-3' frames which are given in table 1, were then subjected to blastp analysis for the purpose of similarity searching, determining the level of conservation among other bacteriocin proteins and determination of possible templates for 3D structure prediction by homology modeling. The search was performed against all non-redundant GenBank CDS translations, PDB, SwissProt, PIR, and PRF databases using default parameters. A total of 100 targets were obtained. However, the overall percentage of sequence homology was not satisfactory (data not shown). This explains the level of diversity that bacteriocin proteins have among different bacterial species and strains.

Primary and secondary structure analysis

ProtParam was used to analyze different properties of the translated reading frames. Frames 1 and 2 were found to be composed of 27 amino acids whereas frame 3 had 26 amino acids. The molecular weight for frame 1, 2 and 3 were calculated to be 2.96kDa, 3.11kDa, and 3kDa respectively. Detailed physiochemical results for translated frames are given in table 2. The molecular weight and small protein length of bacteriocin produced by R. leguminosarum bv. viceae strain LC-31 suggests that it is biologically active and therefore, may possess a wide range of antimicrobial activity.

Different machine learning and neural network based approaches were used to analyze the secondary structures and predict the presence of alpha helices, coils, and extended strands for each frame. Prediction results from different tools are summarized in table 3. Overall, little variation was observed in the results from different prediction tools and servers. Combining the results from each approach, it was observed that reading frame 1 can form two types of secondary structures: alpha helices and beta sheets. Reading frame 2 is predicted to have only beta sheets, whereas reading frame 3 can also form both alpha helices and beta sheets. However, frame 3 was predicted to have more secondary structures as compared to frame 1.

Subcellular localization predictions

Subcellular localization is a key functional attribute of a protein. Since cellular functions are often localized in specific compartments, predicting the subcellular localization of unknown proteins may be used to obtain useful information about their functions and to select proteins for further study. Moreover, studying the subcellular localization of proteins is also helpful in understanding disease mechanisms and for developing novel drugs [32]. All bacterial proteins are synthesized in the cytoplasm, and most remain there to carry out their unique functions. Other proteins, however, contain export signals that direct them to other cellular locations. In Gram-positive bacteria, these include the cytoplasmic membrane, cell wall and extracellular space, and in Gram-negative bacteria, they include the cytoplasmic membrane, the periplasm, the outer membrane and the extracellular space. In most cases, the whole protein is located in a single compartment; however, proteins can also span multiple localization sites [7]. Bacterial cell surface and secreted proteins are of interest for their potential as vaccine candidates or as diagnostic targets. It is also known that bacteriocins are proteins secreted by bacteria to kill other closely related bacterial species. We analyzed all three (5'-3') reading frames for their localization potential by PSORTb. Based on prediction results, reading frame 1 was found to be an unknown protein whereas, reading frames 2 and 3 were predicted to be extracellular proteins.

Tertiary structure prediction, evaluation and assessment

Protein 3D structures can provide us with precise information of how proteins interact and localize in their stable conformation. Homology or comparative modeling is one of the most common protein structure prediction methods in structural genomics and proteomics. Therefore, we tried to model bacteriocin 3D structure using homology modeling. Numerous online servers and tools are available for homology modeling or comparative modeling of proteins. Despite minimal modifications, one initial step that was common in all modeling tools and servers was to find the best matching template. This was done by performing a sequence homology search by BLASTP. Templates are experimentally determined 3D structures of other proteins which share certain levels of sequence similarity with the query sequence. In the next step, template sequence and the protein sequence whose structure has to be determined are aligned using ClustalW2 [14]. A well-defined alignment is very important for the prediction of a reliable 3D structure. Swissmodel and Geno3D are two different servers that were used to model 3D structure of bacteriocin. However, neither of these servers was able to model the structure for any of the three reading frames, because of the absence of a suitable template. We were also unable to model the 3D structure by MODELLER due to absence of any suitable template. These findings are in parallel to the above mentioned blast homology search results where the query does not share more than 30% identity with any other protein in the protein databases at NCBI, PDB and Uniprot. Due to template dependent limitations of homology modeling, another computational biology approach, known as de novo protein structure prediction, was undertaken. Ab initio or de novo protein modeling works on the principle that all the information for a protein structure lies in its amino acid sequence. This method builds a 3D structure based on physical principles rather than on previously solved structures. Several online servers, grid services and offline standalone software applications have been developed for de novo protein modeling. Amongst them, I-TASSER is one of the most widely used online servers for protein structure and function predictions. It works by using a combination of ab initio folding and threading methods. In this study, I-TASSER was used for the prediction of the bacteriocin 3D structure. Each reading frame was separately modeled in I-TASSER and five models were generated for each frame. Models generated for frames 1, 2 and 3 are shown in figure 1, 2 and 3 respectively.

Once the models were generated, they were subjected to structural assessment and validation using PROCHECK, DFIRE2 and the C-Score values from the I-TASSER. Ramachandran plots were generated by PROCHECK. Additionally, the stereochemical qualities were assessed for each predicted model. Assessment results from PROCHECK are summarized in table 4. A total of 15 structural models from three reading frames were analyzed in DFIRE2 and protein conformation free energy scores were calculated. Free energy calculations made by DFIRE2 are provided in table 4.

The final assessment and validation conclusion of protein structures were made on the basis of combined results from PROCHECK, DFIRE2 and I-TASSER's C-Score. In the case of frame 1, models 2 and 3 contained no residues in the disallowed region, one residue in the generously allowed region, and more than 57% of residues were in the most favored regions. By using DFIRE2, predicted energy values for models 2 and 3 were found to be -23.55 and -22.48 respectively, which are comparable to energy values of models 1, 4 and 5. For frame 2, poor ramachandran plots were obtained. In the models generated for reading frame 3, models 1 and 3 had no residues in the disallowed region and one residue in the generously allowed region. However, model 5 also had no residues in the disallowed region and only one residue in the generously allowed region. The energy value for model 1 and 3 were calculated to be -30.64 and -27.64 respectively which were the lowest among the five models. In addition, C-Score value for model 1(-1.86) and 3(-1.91) were found to be highest among five models.

The presence of two or more than two cysteine residues results in formation of disulfide bonds which are known to play an important role in bacteriocin protein stabilization. Two cysteine residues were found in translated frame 3, one cysteine residue in frame 2 and no cysteine residue in frame 1. Therefore, reading frame 3 was inspected for potential disulfide bonding. Two servers were used for the prediction of disulfide bonding state and connectivity prediction: DiANNA [6] and DISULFIND [1]. DiANNA employs a novel diresidue neural network based approach. In the initial stage, PSIPRED is run to predict the protein's secondary structure. PSIBLAST is then run against the non-redundant SwissProt database to obtain a multiple alignment of the input sequence. Next, the cysteine oxidation state is predicted, and then each pair of cysteines in the protein sequence is assigned a likelihood of forming a disulfide bond. Finally, Rothberg's implementation of Gabow's maximum weighted matching algorithm is applied to diresidue neural network scores in order to produce the final connectivity prediction. On the other hand, DISULFIND employs a support vector machines (SVM) binary classifier to predict the bonding state of each cysteine, followed by a refinement stage that classifies all the cysteines in a chain in a collective fashion. Almost similar results were obtained from both disulfide bonding and connectivity prediction servers. Two cysteine residues were found at positions 14 and 25 in the reading frame 3, separated by a distance of 11 amino acids. The presence of disulfide bond forming cysteine residues is a characteristic feature of bacteriocins. It can also be used as a basis for sub-grouping. It has been observed that the antibacterial efficiency of a bacteriocin increases with an increase in the number of disulfide bonds. For example, pediocin AcH with two disulfide bridges has a wider range of antimicrobial activity as compared to lactococcin B which has a single disulfide bridge (Ralph et al., 1995). Also, disulfide bonds are known to be important for the stability of the bacteriocin protein (Olivera et al., 2003; Rober, 2005). In agreement with the above mentioned structure assessment analysis, frame 3 contains two cysteine residues with a highly predicted potential for bond formation and may be a potential bacteriocin protein sequence.

Based upon the current knowledge regarding the activity and functionality of bacteriocins, and computational assessment results, the only models selected as representatives of bacteriocin 3D structure met the following criteria: (i) predicted to be an extracellular protein with the maximum number of secondary structures in comparison to other predicted models; (ii) presence of cysteines residues for disulfide bonding; (iii) Ramachandran plots showing the maximum number of residues in allowed and the least number of residues in disallowed regions; and (iv) minimum free energy score of protein conformation and highest value from C-Score. Therefore, we concluded that reading frame 1 is not likely to be the protein of the given bacteriocin, as it is not considered to be an extracellular protein by PSORTb, has less secondary structure predictions than reading frame 3, and contains no cysteine residues. Reading frame 2 is least likely to be the protein of the given bacteriocin, as it is predicted to have a lower level secondary structure, which is required for bacteriocin function.

We propose that reading frame 3 is the desired protein sequence of the bacteriocin in question and models 1 and 3 are considered as the most probable 3D structure of the given bacteriocin. PSORTb predicted frame 3 to be an extracellular protein, with the maximum number of secondary structures compared to frames 1 and 2. The presence of cysteine residues and disulfide bonding was confirmed by DiANNA and DISULPHID. PROCHECK, DFIRE2 and C-Score assessments, providing the best tertiary structures for frame 3. Although the bonding distance between the cysteine residues was found to be more than the allowed distance (data not shown), further structure refinements of models 1 and 3 may result in the decreased distance between two cysteine residues.

With the assistance of a well-defined structure of bacteriocin, one can predict its functional and binding sites, which can help in understanding the multi-functional role of bacteriocin for competition in nodulation. This knowledge can be further used in drug design to enhance or suppress the production of bacteriocin as required.


We are thankful to the Director of the National Centre of Excellence in Molecular Biology (CEMB) for providing facilities and our colleagues in Virology laboratory, Functional Genomics laboratory, and in Bioinformatics laboratory (CEMB) for encouragement and support to carry out this work.

Conflicts of Interest

The authors declare that they have no conflict of interest.