This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
A protein is a polypeptide chain made up from different amino acids linked together in a definite sequence. Proteins, commonly, contain twenty amino acids. Each amino acid has a similar, yet unique structure. Different proteins have different amino acids, while the amino acids sequence is known as the primary structure of the protein. amino acid sequence can be referred to in two ways: the three letters code and the one letter code as shown in Table 1.1. To illustrate, we can refer to a small peptide which contains 8 risidues using the three-letter code as: AspIleGluPheArgValLeuHis or as: DIEFRVLH using the one-letter code. Proteins are not linear molecules of amino acid sequence like DIEFRVLH for example. Rather, this sequence folds into a complex three-dimensional structure which is unique to each protein. This three-dimensional structure allows proteins to function. Thus in order to understand the protein function, we must understand protein structure. (Hill R. B. 2000) Amino acids are classified by the chemical nature of their side chains. The famous classification of the amino acids divides them into two groups; the polar (or hydrophilic) amino acids and the non-polar (or hydrophobic) amino acids. The polar amino acids have side chains that interact with water, while those of the non-polar amino acids do not.
Most of the amino acids (except for proline) have a carboxyl group and an amino group as shown in Figure 1.1. Each amino acid has a different side chain (or R group). The side chains vary extremely in their complexity and properties. (Akete Lex Adjei 1997) For example the side chain of glycine is simply hydrogen. The non-polar amino acids include: alanine, cysteine, glycine, isoleucine, leucine, methionine, phenylalanine, proline, tryptophan, tyrosine and valine.
The polar amino acids include: arginine, asparagine, aspartate, glutamine, glutamate, histidine, lysine, serine, and threonine.
Figure1.1 Amino acid structure
Levels of Protein Structures
Protein structure can be described in four hierarchical levels of complexity (Golan 2008) as illustrated by Figure 1.2:
Primary structure: This level refers to the "linear" sequence of amino acids. The sequence of amino acids in each protein is determined by the gene that encodes it. The gene is transcribed into a messenger RNA (mRNA) and the mRNA is translated into a protein by the ribosome.
Secondary structure: it is "local" ordered structure brought about via hydrogen bonding mainly within the peptide backbone. The most common secondary structure elements in proteins are the alpha (Î±) helix and the beta (Î²) sheet.
Tertiary structure: this sturcture refers to the "global" folding of a single polypeptide chain. A major driving force in dertemining the tertiary structure of globular proteins is the hydrophobic effect. The polypeptide chain folds allowing the side chains of the nonpolar amino acids to hide within the structure and the side chains of the polar residues to get exposed to the outer surface. Hydrogen bonding ,including groups from both the peptide backbone and the side chains, are important in stabilizing tertiary structure.
Quaternary structure: this structure involves uniting two or more polypeptide chains to form a multi-subunit structure. Quaternary structure is the stable association of multiple polypeptide chains resulting in an active unit. Not all proteins exhibit quaternary structure. Usually, each polypeptide within a multisubunit protein folds more-or-less independently into a stable tertiary structure and the folded subunits then unite together to form the final structure.
Figure 1.2: The four different levels of protein structure
Protein structure Prediction
Predicting the three-dimensional structure of a protein from its linear sequence is a great challenge in the current computational biology. The problem can be described as the prediction of the three-dimensional structure of a protein from its amino acid sequence or the prediction of a protein's tertiary structure from its primary structure.
There are two methods for protein structure prediction: the experimental methods and the computational methods.
1.3.1 Experimental Methods
In the meantime, there are two main experimental methods available for protein structure prediction: X-ray crystallography and nuclear magnetic resonance (NMR). Unfortunately, these methods are not efficient enough because they are expensive and time-consuming. As a result, there is a bad need for a fast and reliable computational method to predict structures from protein sequences, especially because the number of completely-sequenced genomes is growing very fast.
1.3.2 Computational Methods
There are three main computational methods for protein structure prediction which depends mainly on the percentage of similarity of the input protein sequence with other existing sequences in the database.
Homology modeling, also known as comparative modeling, is a prediction method which is used when there is a similarity between the target sequence and the sequences of already exist proteins in protein database. It is based on the observation that the proteins with similar sequence have similar tertiary structure. (Chothia C 1986) As a result, the tertiary structure of a protein can be built using the templates of the known structures that share the target sequence with significant similarity. In comparative modeling the prediction is based on the knowledge of the structure of the existing known proteins, so that the sequence of the unknown protein is aligned to an existing known protein and if a similarity is more than 30% then the three dimensional structure is assumed to be the same (L. McGuffin 2008).
Fold recognition, also known as protein threading, is an inverse of protein folding problem. It based on the fact that the number of new folded protein structure is not growing fast comparing to the number of new protein sequences, which leads to the observation that any new predicted structure will be almost folded to an existing structure in the database. In this method the target sequence is aligned to the existing structure templates in the library to find the best structure that matches this sequence. Although fold recognition technique will not yield equivalent results as those obtained from experimental methods, it is a comparatively fast and inexpensive way to a build a close approximation of a structure from a sequence. The main concept of fold recognition methods is to turn the problem of homology modeling upside down. In other words, the procedure will be to calculate how well each potential structure would fit a sequence, rather than how well each sequence fits a structure. Both fold recognition methods and homology modeling are template-based methods, but homology modeling is for those targets that have homologous proteins with known structure while fold recognition is for the targets with only fold-level homology (L. McGuffin 2008).
Ab initio is a prediction method that seeks to predict the tertiary structure of a protein from its amino acid sequence alone -without knowledge of similar folds. It has been called by several names like de novo modeling, free modeling or physics-based modeling (W. S. Lee J 2009). It based on the thermodynamic hypothesis (Anfinsen 1973) which states that the tertiary structure of the protein is the conformation with the lowest free energy. Ab initio modeling, however, is challenging for the following reasons. First, there is a huge number of proteins that have no homology with any of the known structure proteins. Second, some proteins which show high homology with other proteins have different structures. Third, comparative modeling does not offer any perception of why a protein adopts a specific structure (Helles 2008).
The knowledge of the 3D structure of proteins is essential for understanding their biological functions.
The difficulty of determining the three dimensional structure of proteins has led to an increasing gap between the huge number of protein sequences and the limited number of protein structures. Figure 1.
The number of available protein structures in the PDB database is 2 to 3 orders of magnitude smaller than that of the available protein sequences.(on Tuesday Jan 19, 2010 at 4 PM PST there are 62,787 Structures and 10,158,056 sequences according to rcsb PDB and UniPort databases respectively ).
Thus an affordable approach and a high throughput method are urgently needed in order to understand the biological systems and to reduce the gap between protein sequence and protein structures.
Figure 1.3: The growth of the protein sequences
Research Goals and Objectives
The main objective of this thesis is to propose a new algorithm for protein tertiary structure prediction using the Harmony Search Algorithm as a searching tool. This thesis also proposes a parallelized platform of the HSA to enhance the speed of protein structure prediction. Therefore, the objectives of this thesis are:
Enhancing protein tertiary structure prediction using Harmony Search Algorithm.
Enhancing the speed of protein structure prediction using parallel technique.
Contribution to the thesis
The following points summarize my contribution to the thesis:
The main contribution is adapting Harmony Search Algorithm to the protein structure prediction which will be the first use of this algorithm in this problem. This requires the adaptation of the algorithm to be suitable to this new problem; I thereby propose three contributions in the parameters of HS in my implementation :
Proposing a new method to select the PAR parameter to be changed through the improvisation process, the simulated annealing for that is used as follows:
Tn = 1000000
PAR = PAR * Exp [- abs (best energy) / Tn]
Tn =Tn-Î±Tn // .005 <=Î±<.5
b) Proposing a new method to select the HMCR parameter to be changed through the improvisation process, the simulated annealing for that is used as follows:
Tn = 1000000
HMCR = HMCR + HMCR * (1-Exp [- abs (best energy) / Tn])
Tn =Tn-Î±Tn // .0005 < =Î± < 0.05
c) Applying mutation to the best harmony by randomly changing one of the angles and calculating the energy.
Parallelizing the proposed algorithm to improve the time and this will be the first attempt to parallelizing the HSA.
This thesis contains six chapters. The first chapter presents the objectives of this thesis. It presents a background of proteins and protein structure prediction methods. It also discusses the research objectives, motivations, and contributions.
The second chapter discusses the most current and related works in protein structure problems. It also discusses the most important ab initio protein structure prediction algorithms. The reasons for choosing the methodologies for the proposed system used in this research are also discussed in this chapter.
The third chapter covers the methodology discussion of the way the proposed solution is designed. Moreover, the chapter introduces the new algorithm for protein structure prediction. It also discusses the parallelized algorithm.
The fourth chapter discusses the implementation details and issues. In addition, the chapter illustrates the adaptation of HSA parameters.
The fifth chapter discusses the results obtained from the experiments in Chapter 4. I t is divided into two sections; the first section reports the results of the sequential adapted algorithm (AHSA). The second section reports the results of the Parallelized version of the adapted Algorithm (PAHSA).
Finally, the sixth chapter presents the conclusions, recommendations, and the possible future work for this study.
This chapter will explore the related works of the different methods of protein structure prediction by ab initio modeling. The components of ab initio modeling are: conformational search algorithm, energy function and a selection method. We classify the various algorithms of ab initio method based on these components. Later, the proposed solution is also described in this chapter, while the design and the methodology of the proposed solution are discussed in detail in Chapter Three.
2.2 Conformational Search Methods
A successful ab initio method for protein structure prediction depends on a powerful conformational search method to find the minimum energy for a given energy function. Molecular Dynamics (MD) and Monte Carlo (MC) are two common methods to explore protein conformational search space. For protein prediction, these two methods require an enormous amount of computational resources to explore the conformational space. A main technical difficulty of Monte Carlo simulations is that the energy landscape of protein conformational space is quite rough containing many energy barriers, which may trap the MC simulation procedures.
Different conformational search methods have been developed to overcome these problems as we will discuss in this section. We will illustrate the key ideas of conformational search methods used in various ab initio protein structure prediction methods. Until now, there is no single powerful search method that outperforms the others for all cases, while we can find some which outperforms others in some cases.
2.2.1 Molecular Dynamics (MD)
Molecular Dynamics is a powerful tool to investigate equilibrium and transport properties of many-body systems. The nuclear motion is modeled using the laws of classical mechanics such as Newton's equations (Marx D 2000). This method is most often used for the study of protein folding pathways (Duan Y 1998). One of the major issues of this method is its long simulation time, since the incremental time scale is usually in the order of femtoseconds while the fastest folding time of a small protein is in the millisecond range in nature. MD simulations are often carried out for structure refinement since the conformational changes are assumed to be small especially when a low resolution model is available. The first application of Molecular Dynamics to the protein was the work of (McCammon JA 1977) who investigated the dynamics of a folded global protein with two limitations in their model; the approximate nature of the energy function and the neglect of solvent. A research of (Edward Z. Wen 2004) presented an observed folding pathway for a 23-residue protein called Î²Î²Î±1; the results come with enhancing the sampling efficiency in molecular dynamics of ab initio folding simulations. Another remarkable approach is the work of (K. M. Liwo A 2005) who have implemented a MD simulation with the physics-based force field UNRES. Their results showed that their approach can carry out simulations of protein folding in real time, which make it possible to explore folding pathways and to derive the distribution of folding times. Later (Ding F 2008) developed an all-atom Discrete Molecular Dynamics protein model which can perform folding simulations of six small proteins with distinct native structures. Their model indicated the importance of environment-dependent hydrogen bond interactions in modeling protein folding. Recently, (Vincent A. Voelz 2010) simulated several folding trajectories of a 39-residue protein called NTL9 which has a folding time of âˆ¼1.5 millisecond. They generated ensembles of trajectories out to âˆ¼40 microsecond using distributed molecular dynamics simulations in implicit solvent on GPU processors.
2.2.2 Monte Carlo Simulations
Simulated Annealing (SA) is a stochastic optimization procedure which is widely applicable and has been found effective in several problems arising in computer aided circuit design. (Kirkpatrick S 1983) SA is as general as it can be applied on any optimization problem. The simulated annealing uses Metropolis Monte Carlo algorithm to generate a series of conformational states following the canonical Boltzmann energy distribution for a given temperature, it starts by high temperature followed by a consequent simulations which slowly decrease the temperature. Although SA is simple, its conformational search efficiency is striking in comparison to other more sophisticated methods discussed below.
Monte Carlo with minimization (MCM) (Li Z 1987)was successfully applied to the conformational search of ROSETTA's high-resolution energy function to overcome the multiple-minima problem. In MCM, one performs MC moves between local energy minima to compare it with the previously accepted local minimum to update the current conformation of each perturbed protein structure. For a given local energy minimum structure, a trial structure is generated randomly and is subject to local energy minimization. The acceptance of this trial structure is determined by the usual Metropolis algorithm by calculating the energy difference between the two structures.
Sometimes, MC simulations get stuck in a meta-stable state that may deform the distribution of sampled states, and that's when the energy landscape of the system is rough. Many simulation techniques have been developed to avoid this problem, one of the most successful techniques is the one based on the generalized ensemble approach in contrast to the usual canonical ensemble. These techniques were known by different names like multi-canonical ensemble and entropic ensemble (W. S. Lee J 2009). The basic idea in these techniques is to accelerate the transition between states separated by energy barriers by modifying the transition probability such that the final energy distribution of sampling becomes more or less flat rather than bell-shaped. A famous similar method is the replica exchange Monte Carlo Method (REM) (Kihara D 2001) where a set of many Monte Carlo simulations with different temperatures covering the entire folding transition region are carried out. To overcome energy barriers, temperatures can be exchanged from neighboring simulations to sample states from time to time. Parallel hyperbolic sampling (PHS) (K. D. Zhang Y 2002) further extended the REM method by dynamically deforming energy using an inverse hyperbolic sine function which more quickly explores the low-energy barriers in the protein.
2.2.3 Genetic Algorithm
Genetic Algorithm was suggested to use for protein folding simulations by (Unger R 1993) who proved the Schemata theorem in the context of protein structure observing that Genetic Algorithm gives more attention to favorable local structures while unfavorable local structures will be rapidly abandoned. Later (Konig R 1999) improved that method by investigating a new search strategy in combination with the simple genetic algorithm on a two-dimensional lattice model. They proposed a new search strategy called systematic crossover, which prevents the population from becoming too homogeneous. Comparing their method with (Unger R 1993), they showed that their new search strategy in combination with the simple genetic algorithm significantly increased the search effectiveness.
One of the successful genetic algorithms was proposed by (Torres S. R. 2007) with some good features like using heuristic secondary structure information to initialize the genetic algorithm with an enhanced 3D spatial representation. They used hash tables, which increase the efficiency of search and operations. In general, their model is a good predictor in comparison to the results of CASP 7, but it still needs some effort to improve the quality of the energy function and the spatial representation. It is important to highlight that the use of hash tables introduced an excellent computational technique to model amino acid spatial occupancy, because the number of collisions has been reduced to zero and the insertion, erasing and search were very efficient. Recently, (Hoque M.T 2009) presented the ab initio protein structure prediction as a conformational search problem in a low resolution model using the genetic algorithm. They showed that nondeterministic approaches such as the genetic algorithm (GA) found to be relatively promising for conformational search. However, GA often fails to provide reasonable outcome, especially for longer sequences and that is due to the nature of the complex protein structure prediction problem.
2.2.4 Mathematical Optimization
The search approach called Î± branch and bound (Î±BB) (F. C. Klepeis JL 2003) (W. Y. Klepeis JL 2005) is mathematically strict, while other methods discussed here are stochastic and heuristic methods. In this approach, the search space is cut into two halves and the lower and upper bounds (LB and UB) of the global minimum are estimated for each branched phase space. The upper bound is estimated to be the best obtained local minimum energy, while the estimate for the lower bound is obtained from the modified energy function multiplied by a quadratic term of the dissecting variables with the coefficient Î±. The LB can get the value of one energy minimum by giving a large value of Î±. By repeating the analysis of the phase space with estimating the lower and upper bounds, we can eliminate phase spaces with LB higher than the global UB.
2.3 Energy Functions
Depending on the use of statistics from the existing protein 3D structures, energy functions can be classified into two groups: physics-based energy functions and knowledge-based energy functions.
2.3.1 Physics-Based Energy Functions
In a physics-based ab initio method, interactions between atoms are based on quantum mechanical theory with only a few fundamental parameters such as the electron charge and the Planck constant; all atoms should be described by their atom types where only the number of electrons is relevant (Weiner SJ 1984). However, the computational resources required to predict protein structure from quantum mechanics are still far from what is available now. Therefore, there are no serious trials to predict structures of proteins from quantum mechanics. Some of the methods that used all-atom physics-based force fields include AMBER (Weiner SJ 1984) (Cornell WD 1995) (Duan Y 1998), CHARMM (Brooks BR 1983) (Neria E 1996) (MacKerell Jr. AD 1998), OPLS (T.-R. J. Jorgensen WL 1998) (M. D.-R. Jorgensen WL 1996) and GROMOS96 (van Gunsteren WF 1996). These potentials contain terms associated with bond lengths, angles, torsion angles, van der Waals, and electrostatics interactions. However, the major difference between them is in the selection of atom types and the interaction parameters.
For protein folding, these classical force fields were often linked with molecular dynamics simulations. The results, from the position of protein structure prediction, were not quite successful. The first landmark in such a MD-based ab initio protein folding was the work of (Duan Y 1998) who simulated the villin headpiece in explicit solvent for four months on parallel supercomputers starting from a fully unfolded extended state. Although the protein folding resolution was not high, the best of their final model was within 4.5 Å to the native state. Recently, using Folding@Home, a worldwide-distributed computer system, this small protein was folded to 1.7 Å with a total simulation time of 300 ms (Zagrovic B 2002). However, the all-atom physics-based MD simulation is still far from being used for structure prediction of long proteins (of size ~100-300 residues).
While the all-atom physics-based MD simulations were not particularly successful in structure prediction, fast search methods (such as Monte Carlo simulations and genetic algorithms) have shown to be promising in structure prediction. One example is the project of Liwo and colleagues (L. J. Liwo A 1999) (K. M. Liwo A 2005) (Oldziej S 2005) who developed a physics-based protein structure prediction method which combines the coarse grained potential of UNRES with conformational space annealing method of global optimization. In UNRES, each residue is described by two interacting off-lattice united atoms, CÎ± and the side chain centre. This effectively reduces the number of atoms, enabling us to handle large polypeptide chains (> 100 residues). The resulting prediction time for small proteins can be then reduced to 2-10 hours. The UNRES energy function6 is probably the most accurate ab initio method available, and it has been systematically applied to many CASP targets since 1998.
A multistage hierarchical algorithm ASTRO-FOLD (F. C. Klepeis JL 2003) (W. Y. Klepeis JL 2005) is another example of physics-based modeling approaches. The first stage of this model is to predict the helical segments by partitioning the overall target sequence into oligopeptides then calculate a free energy function which includes entropic, cavity formation, polarization, and ionization contributions for each oligopeptide. In the second stage, Î²_strands, Î²_sheets, and disulfide bridges are identified through a novel superstructure based mathematical framework. The RMSD of the predicted model was 4.94 Å over all 102 residues. The relative performance of this method for a number of proteins is yet to be seen in the future.
Recently, a novel approach was proposed (Taylor WR 2008) which generates many thousands of models based on an idealized representation of structure given the secondary structure assignments and the physical connection constraints of the secondary structure elements. The top scoring conformations are selected for further refinement (Jonassen I 2006). The authors successfully folded a set of five small Î±Î² proteins in the range of 100-150 residues length with the first model within 4-6 Å RMSD of the native structure. Recently, development of ROSETTA, (Bradley P 2005) (Das R 2007) used a physics-based atomic potential Monte Carlo structure refinement, after performing the low-resolution fragment assembly in the first stage (Simons KT 1997).
2.3.2 Knowledge-Based Energy Function
Knowledge-based energy functions use the statistics of the solved structures in PDB. They can be divided into two types (Skolnick 2006): The first one is sequence-independent terms that describe a generic protein such as the hydrogen bonding and the local backbone stiffness of a polypeptide chain (K. A. Zhang Y 2003). The second one is sequence-specific terms that describe local terms reflecting secondary structural preferences, including: pairwise residue contact potential (J. L. Skolnick J 1997), distance dependent atomic contact potential (Samudrala R 1998) (Lu H 2001) (Z. Y. Zhou H 2002) (Shen MY 2006), and secondary structure propensities (K. A. Zhang Y 2003) (S. J. Zhang Y, The protein structure prediction problem could be solved using the current PDB library 2005).
However, the local protein structures are difficult to reproduce in the reduced modelling eventhough the knowledge-based energy functions contain secondary structure propensity.
There are two prediction methods which used knowledge-based energy functions, and showed a remarkable success in ab initio protein structure prediction (Zhang Y, 2004a) (Simons KT 1997).
One of the remarkable methods (Bowie JU 1994), produced protein models by assembling small fragments taken from the PDB library. A successful algorithm called ROSETTA (Simons KT 1997) was developed, and showed a good performance for the free modelling targets in CASP experiments and made the fragment assembly approach popular in the field. Recently, ROSETTA was further improved by researchers (Bradley P 2005) (Das R 2007), who generated models in a reduced form with conformations specified with heavy backbone and CÎ² atoms as a first round, then in the second round they built a set of models by refining low-resolution models from the first round by an all-atom refinement procedure using an all-atom physics-based energy function, including van der Waals interactions and an orientation-dependent hydrogen-bonding potential.
After the success of the ROSETTA algorithm, many researchers developed their own energy functions using the idea of ROSETTA. For example, the energy terms of Simfold (Fujitsuka Y 2006) and Profesy (K. S. Lee J 2004); include van der Waals interactions, hydrophobic interactions, backbone dihedral angle potentials, backbone hydrogen-bonding potential, pairwise contact energies, and beta-strand pairing.
TASSER (Zhang Y, 2004a), is another successful free modeling approach which used a knowledge-based energy to construct 3D models. The used energy terms include information about predicted secondary structure, backbone hydrogen bonds, consensus predicted side chain contacts, a short-range correlation and hydrophobic interactions. In this model, the authors used both threading, to search for possible folds first, and ab initio modeling to reassemble full-length models, and build the unaligned regions.
Chunk-TASSER (S. J. Zhou H 2007), is new development of TASSER which first divides the target sequences into chunks, each chunk contains three successive secondary structure elements (helix and strand).
I-TASSER (Wu S 2007), is another development of TASSER which used iterative Monte Carlo simulations to refine TASSER cluster centroids. I-TASSER built models with correct topology (~3-5 Å) for seven cases with sequences up to 155 residues long. Recently, a comparative study on 18 ab initio prediction algorithms identified that I-TASSER is the best method in terms of the modeling accuracy and CPU cost per target (Helles 2008).
2.4 Model Selection
Another open problem in protein structure prediction is the ability to select the best appropriate models which are closer to the native structure than to the templates used in the construction. Model Quality Assessment Programs (MQAPs) were developed to perform this task (Fischer 2006). In general, model selection approaches can be classified into two types; the energy based and the free-energy based. We will focus in this section on the energy-based model selection methods, and we will discuss three methods: (1) physics-based energy function; (2) knowledge-based energy function; (3) scoring function describing the compatibility between the target sequence and model structures. There is another popular method in Model Quality Assessment Programs called consensus based method, which uses the similarity of other models taken from the predictions generated by different algorithms (Wallner B, Prediction of global and local model quality in CASP7 using Pcons and ProQ 2007). This method is also called meta-predictor approach. The essence of this method is similar to the clustering approach since both assume the most frequently occurring state as the near-native ones. This approach has been mainly used for selecting models generated by threading-servers, and it has so far been the most successful MQAP (Wallner B, Prediction of global and local model quality in CASP7 using Pcons and ProQ 2007) (Wu S 2007).
2.4.1 Physics-Based Energy Function
To develop an all-atom physics-based energy function, some researchers used existing solvation potential methods to discriminate the native structure from decoys that are generated by threading on other protein structures. For example, CHARMM (Neria E 1996) and EEF1 (Lazaridis T 1999b) were exploited and found that the energy of the native state is lower than those of decoys in most cases. Later, (Petrey D 2000) used CHARMM and a continuum treatment of the solvent, (Dominy BN 2002) and (Feig M 2002) used CHARMM plus GB solvation, (Felts AK 2002) used OPLS plus GB, (Lee MC 2004) used AMBER plus GB, and (Hsieh MJ 2004) used AMBER plus Poisson-Boltzmann solvation potential on a number of structure decoy sets (including Skolnick decoy set (Kihara D 2001) (Z. Y. Skolnick J 2003) and CASP decoys set (Moult J 2001)). All the above authors obtained similar results, i.e. the native structures have lower energy than decoys in their potentials. The claimed success of model discrimination of the physics-based potentials seems contradicted by other less successful physics-based structure prediction results. Recently, (Wroblewska L 2007) showed that the AMBER plus GB potential can only discriminate the native structure from roughly minimized TASSER decoys. Their results partially explained the inconsistency between the widely-reported decoy discrimination ability of physics-based potentials and the less successful folding results.
Knowledge-Based Energy Function
A pairwise residue-distance based potential using the statistics of known PDB structures (Sippl 1990) was developed opening the door to many researchers to propose different knowledge-based potentials such as atomic interaction potential, solvation potential, hydrogen bond potential and torsion angle potential. One of the most widely-used knowledge-based potentials is a residue-specific all-atom distance-dependent potential, (Samudrala R 1998); this potential counts the distances between 167 pseudo-atoms. Later, several atomic potentials with various reference states have been proposed (Lu H 2001) (Z. Y. Zhou H 2002) (Shen MY 2006) (Wang K 2004) (Tosatto 2005) with a claim that native structures can be distinguished from decoy structure. However, the task of selecting the near native models out of many decoys remains as a challenge for these potentials (Skolnick 2006); this is more important than native structure recognition because there are no native structures available from computer simulations in reality. In coarse-grained potentials, each residue is represented either by a single or a few atoms, for example, CÎ±-based potentials (Melo F 2002), CÎ²-based potentials (Hendlich M 1990), side chain centre-based potentials (Bryant SH 1993) (Kocher JP 1994) (Thomas PD 1996) (K. S. Zhang C 2000) (L. S. Zhang C 2004) (J. L. Skolnick J 1997), side chain and CÎ±-based potentials (Berrera M 2003). Based on the CAFASP4-MQAP experiment in 2004 (Fischer 2006), the best-performing energy functions are Victor/FRST (Tosatto 2005) and MODCHECK (Pettitt CS 2005); the first one incorporates an all-atom pairwise interaction potential, solvation potential and hydrogen bond potential while the second one includes CÎ² atom interaction potential and solvation potential. Later in CASP7-MQAP, a new model (Wallner B, Prediction of global and local model quality in CASP7 using Pcons and ProQ 2007) performed best using Pcons and ProQ based on structure consensus.
2.4.3 Sequence-Structure Compatibility Function
In the third type of MQAPs, the best models are selected based on the compatibility of target sequences to model structures instead of being selected purely based on energy functions. The earliest successful example is (Luthy R 1992) which evaluate structures using threading scores. This method was improved later by Verify3D (Eisenberg D 1997) using local threading scores in a 21-residue window. Another method (Colovos C 1993) is proposed for differentiating bettween correctly and incorrectly determined regions of protein structures based on characteristic atomic interactions, this method used a quadratic error function to describe the non-covalently bonded interactions, where near-native structures have fewer errors than other decoys. GenThreader (Jones 1999) is an efficient method which used neural networks to classify native and non-native structures. The inputs of GenThreader include sequence profile score, pairwise energy, solvation energy, number of aligned residues, length of template structure and length of target sequence. Another neural- network-based method called ProQ (Wallner B, 2003) was developed to predict the quality of a protein model that extracts structural features. The inputs of ProQ include atom and residue contacts, solvent accessible area, protein shape, similarity between predicted and model secondary structure and structural alignment score between decoys and templates. Later, a consensus MQAP called ModFold (L. McGuffin, The ModFOLD server for the quality assessment of protein structural models 2008) was developed which combined scores obtained from ProQ, (Wallner B, 2003), MODCHECK (Pettitt CS 2005) and ModSSEA (McGuffin, 2007). The author showed that ModFold outperformed the previous individual MQAPs.