Homology modeling, or comparative modeling, refers to the construction of a three-dimensional model of a protein target based on the alignment of its protein sequence with a similar protein with known structure (template). It is believed that in the same family the three-dimensional structure of proteins is more conserved than their primary structures and that the number of different adopted folds are limited. In the end of a homology modeling task, the goal is to obtain an accurate homology model, which can provide more information about the target protein than any homologous structures.
Most homology modeling procedures involve four general steps: (1) identifying suitable templates, (2) pairwise-alignment between the amino-acid sequence of the target and the amino-acid sequence of the chosen template, (3) building the model for the target protein, and (4) evaluating the quality of the resulting models.
The first step is based on the search of templates from a set of evolutionary related proteins with experimentally solved structures. RSCSB Protein Data Bank (PDB) can be used, using the target sequence as the query. The choosing of the template can be based on the level of sequence identity, the experiemtnal quality of the solves structures (e.g. resolution of the crystallographic structure, number of restraints per residue for an NMR structure), the presence of ligands or cofactors and so on.
This is a crucial step, as homology modeling can only generate a model as good as the chosen template, which makes the availability of homologous templates one of the major bottlenecks of homology modeling. Insertions, delections and other motifs that can cause large structural changes are not excepted to be most accurately incorporated in the final homology model.
Second, the identity is estimated after a pair-wise alignment between the protein sequence of the target and the protein sequence of the chosen template. This is done by means of sequence alignment packages and possible further manual adjustment. In this step, choosing a good template is crucial. The quality of a template increases with its sequence identity to the target sequence and decreases with the number and length of gaps in the alignment. Generally, major differences between the different prediction methods are only observed for target-template pairs sharing less than 40% of sequence identity. As the sequence identity decreases, the alignment becomes less accurate and will contain an increasingly large number of gaps and alignment error. One of the main factors for poor homology models are appointed as alignment errors. Therefore, choosing a high identity template can be a first estimation of the quality of the model.
Decreasing alignment accuracy between template and target model structures result in a increase in the expected RMSD value of the resulting homology model. Dunbrack has reviewed the development of algorithms for sequence comparison and alignment.
The third step includes building the model for the target sequence. The method used for this purpose depends on the used software and the energy function that is behind the program. In general, most software packages use the positions of the target sequence that are aligned to a template structure and model the target structure by simply copying the coordinates for the backbone atoms or by using this information to generate spatial restraints. Modeling unaligned regions require different tactics, which differ from software to software.
The final step is to evaluate the quality of the resulting models. After models are built it is important to verify them for possible errors. Scoring functions have been developed for estimating the overall quality of the model and comparing predictions on the basis of alternative alignments. There are defined two means of evaluating a model: internal and external evaluation. Internal evaluation of self-consistency verifies if the model satisfies the restraints used to calculate it (e.g. Procheck and WhatCheck). External evaluation relies on information that was not used in the calculation of the model and on prediction (e.g. stereochemical errors due to alignment errors). A model should also be consistent with experimental data, such as site-directed mutagenesis, cross-linking data and ligand binding. Different approaches are used by each software package used for the homology modeling task.
However, the accuracy of individual models may vary significantly from the expected average quality due to poor target-template alignments, low template quality, structural flexibility or inaccuracies introduced by the modeling program. The large number of degrees of freedom in a protein chain and the irregularity of the energy landscape produced by atomic repulsion at short distances can greatly complicate the homology modeling task.
If the resulting model is not satisfactory, some or all of the steps can be repeated to obtain a better model.
Software packages for homology modeling
There are many available software packages used for the purpose of homology modeling. All of these programs share two common components: (a) an energy function that evaluates the favorability of a particular sequence to a particular structure and (b) a sampling procedure to search for low-energy sequences.
Herein is described three different softwares which were used in this homology modeling task: the software SWISS-Mode (Swiss Institute of Bioinformatics), Modeller (Laboratory of Andrej Sali, of University of California), Rosetta ( ) and I-TASSER. These softwares will be described in terms of the methodologies and approaches used, algoritms behind each software for the homology model calculation and possible limitations.
SWISS-MODEL was the first automated modeling server publicly available (Fiser & Sali, 2003) and can be accessed at http://swissmodel.expasy.org/. The SWISS-MODEL workplace integrates the software required to perform the four steps of homology modeling and access to various protein sequence and structure databases (Bordoli et al., 2009). It allows the user to construct models from a computer with internet connection without the need to download and install large software packages and databases (Arnold et al., 2006). This feature allows the user to build and evaluate protein homology models at different levels, depending on the complexity of the task (Bordoli et al., 2009). Furthermore, it also allows storing previous modeling tasks that can be accessed at any time.
The modeling method used by this software is modeling by rigid-body assembly, which constructs the model from a few core regions and from loops and side chains, which are obtained from accessing information on related structures. The main limitation of this method is the dependency of the accuracy of the model on a good alignment - gaps in the target-template alignment can result in poor models.
To obtain a homology model of a target sequence using SWISS-MODEL, it is possible to choose among three different approaches, whose applicability depends on how distantly related your target protein and the homologous template are: automate mode, alignment mode, and project mode.
The automated mode is a highly automated modeling procedure with a minimum user intervention used when highly structural templates are available (50% or more shared identity). The template protein can be chosen by the user (through the corresponding PDB code) or close homologous can be identified using a BLAST search against the SWISS-MODEL Template Library. If several similar template structures are available, the automated selection will favor high-resolution templates with good-quality assessment.
The lack of a highly similar template prompts more complex modeling tasks and the user is allowed to control several steps to construct a model that is more accurate. This can be achieved by using the available alignment mode or the project mode.
At the end of the modeling task the results are presented in a graphical summary which can be downloaded. Each resulting model is accompanied by several quality checks.
The results of QMEAN, a composite scoring function for model quality estimation, and DFIRE, an all atom distance-dependent statistical potential, are provided as global indicators of the quality of the resulting models. For local model quality SWISS-MODEL workplace provides graphical plots of ANOLEA mean force potential, GROMOS empirical force field energy and the neural network-based approach ProQres. To assess the conformational quality of both models and template structures (e.g. the sterochemical plausibility, deviations in amino-acids conformations such as bond lengths and angles) the user can check the Whatcheck and Procheck reports.
Modeller allows homology modeling given a target-template sequence alignment and an associated template structure. There are available several graphical interfaces to Modeller (e.g. Chimera via the Multalign Viewer).
Modeller implements an automated approach to homology modeling by satisfaction of spatial restraints. This method has three steps.
First, many distance geometry and dihedral angle restraints on the target sequence are obtained from the alignment of the target sequence with the template structure and these constitute the first source from where Modeller extracts spatial restraints. The form of these restraints was obtained from a statistical analysis of the relationships between many pairs of homologous structures. By scanning the database, tables quantifying various correlations were obtained, such as the correlations between two equivalent CÎ± - CÎ± distances, or between equivalent mainchain dihedral angles from two related proteins. These relationships were expressed as conditional probability density functions (pdf's) and can be used directly as spatial restraints. For example, probabilities for different values of the mainchain dihedral angles are calculated from the type of a residue considered, from mainchain conformation of an equivalent residue, and from sequence similarity between the two proteins. An important feature of the method is that the spatial restraints are obtained empirically, from a database of protein structure alignments.
Next, the spatial restraints and Charmm energy terms enforcing proper stereochemistry are combined into an objective function. Here Modeller extracts the second group of constraints.
Finally, the model is obtained by optimizing the objective function in Cartesian space. The optimization is carried out by the use of the variable target function method employing methods of conjugate gradients and molecular dynamics with simulated annealing, which minimizes violations of the spatial restraints.
The output is a three-dimensional structure that satisfies these restraints as well as possible. Modeller has the advantage of modeling quite accurate models based on poor alignment between the pair target-template (e.g. with some gaps), as it just adds a few additional spacial restraints to the final optimization process
This homology modeling package can be run on a fully automated mode, using default values and with little user intervention. This feature permits to automate many of the steps required for the modeling task. However, the user is allowed to customize the task for more complex tasks and control several options as including water molecules, HETATM atoms and hydrogen atoms. The more variables that are included for the task, the longer while take the building phase.
At the end of the modeling task the models are primarily classified according to GA341 and zDOPE values. Using the TSVMod from the Sali lab's Model Evaluation Server, Modeller calculates the estimated RMSD for each model, which permits to predict the overall accuracy of the resulting model.
The Rosetta program was developed with the goal of de novo structure prediction. Nowadays, it is a unified software package which allows the user to perform many different tasks, namely homology modeling. Similar to other homology modeling software, this application was developed to build structural models of proteins based on a known structure as template.
Rosetta's homology modeling approach is based on three tasks: (1) built an incomplete model structures based on the alignment between the target sequence and a chosen template, (2)complete the missing structure using loop modeling and (3) rank or evaluate the energy of the resulting structural models.
First, an incomplete model is made based on the template structure by copying coordinates over the aligned region, as most homology modeling software.
Second, the rebuilding of the missing structure is done using loop modeling. A library of fragments that represent the range of structures for all short segments of the protein chain is used for the next step. The necessary nine- and three-residue fragments can be obtained by a complementary tool offered by the same team - Robetta (at http://robetta.bakerlab.org/) - or generated by the user while using a inside feature of Rosetta. A course of action of the Rosetta algorithm is to attempt to mimic the interplay of local and global interactions in determining protein structure. Using a Monte Carlo optimization procedure, Rosetta assembles the fragments into compact structures that comply to maximizing hydrophobic burial and satisfying the hydrogen-bonding potential of Î²-strands. Informations on Sampling Strategies for Backbone Degrees of Freedom e Sampling Strategies for Side Chain Degrees of Freedom.
Finally, the fitness of each conformation is evaluated on the basis of a scoring function. Rosetta scoring function contains knowledge-based potentials that dictate strand-strand and strand-helix interactions. These potentials specify the optimal distances of interactions and preferred orientations (e.g. right-hand twist of Î²-sheets, lack of clashes and hydrophobic packing). This approach is coupled with a (mostly) knowledge-based energy function, which assumes that most molecular properties can be derived from (e.g) the PDB. Simulations with Rosetta can be based on a knowledge-based centroid energy function in the low-resolution mode or knowledge-based all atom energy function at the high-resolution mode.
Full-atom refinement with full-atom energy function
Selection of models using clustering
Although Rosetta's homology modeling task has produced several accurate and precise models, some of the precision can be incorrect, due to imperfections in the knowledge-based energy function or to practical limits on the sampling method.