Analysis Of Protein Folding Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Protein Structure Prediction has significant applications in the fields of drug design, disease prediction and so on. Since PSP has been a great confrontation in the field of Protein Folding Research, this paper presents a novel method for protein using Structural Concealed Markov Model (SCMM). Typically, the contribution of this work has been made for appropriate mapping of protein primary structure to its 2D fold. Moreover, the model incorporates Extended Genetic Algorithm (EGA) for effectively folding the protein sequences that are having long chain lengths. Furthermore, Extended Genetic Algorithm (EGA) is incorporated in this model to intellectually fold the protein sequences encompasses the long chain length. The protein sequences made by SCMM are preprocessed, indexed and evaluated for accurate classification. The criterion based analysis is then enforced with some parameters like similarity, fitness and sequence gaps in order to construct the optimal protein structures. The experimental results reveal the improved efficiency and accuracy of the proposed method with a performance analysis.

Index Terms- Protein folding, Classification, High Dimensional Data, Fitness Correlation, SCMM, EGA.


In general, Protein folding is a method by which the protein structure deduces it s functional configuration. Proteins are folded and held bonded by several forms of molecular interactions [3]. Hydrophobic interactions, formation of disulphide binders in proteins, thermodynamic constancy are some of the molecular interactions. More compact and ordered structure incorporates folded state of a protein. Similarly, unfolded state of the protein is termed as prominently greater and significantly less ordered structure. With the position of disulphide bonds and linear sequence of amino acids, the fundamental protein structure is furnished. Protein fold recognition is a momentous technique that discovers the structure based on sequence similarity [15]. It was claimed in [5] that the determination of energy on modeling of protein structure rely on the two hydrophobic residual count. The two residues are adjacently neighbor and non successive in protein sequence.

Alternatively, determination of tertiary protein structures by exploring the knowledge of its primary structures is termed as prediction of protein structure [11]. There explicate two capital issues in prediction of protein structure. Figure 1 exposes the sample protein residue chain with energy -4. Hereby, White Square indicates hydrophilic residue, whereas hydrophobic residues are indicated as black squares. The dashed line signifies the contacts of hydrophobic-hydrophilic (HH), while the protein sequences are depicted as a solid line. Connected H (Covalent Bond) and non-connected H (Non Covalent Bond) are the two types of HH interactions.

Figure 1: Sample Protein Chain

Though the development and innovation of new drugs and therapies is in progress, it is substantial to compute and analyze the biological data retrieved in genome sequencing. Sequence-structure and sequence- sequence perception play an analytical role in predicting a possible cellular function for sequences. In the identification of mutual relationship between proteins, positioning the sequence yields accuracy [9].

The remainder of this paper is structured as follows: Section 2 exemplifies the deliberation on the related review. Section 3 affords the description about the proposed work. Section 4 evaluates the experimental results and Section 5 sums up the paper with some conclusion and path ahead.


Myriad researches have been made on for solving protein folding problem to acquire appropriate protein structure. STAPL (the Standard Template Adaptive Parallel Library) was adopted for parallel protein folding [12]. The paper comprised roadmap analysis, potential energy calculations to achieve effective parallel folding. Sequential codes were utilized to obtain scalable speedups. Guided Genetic Algorithm was presented in [7] for protein folding prediction in two dimensional Hydrophobic-Hydrophilic (HP). To accomplish effective boundary perseverance, shape of H-core was furnished. Tilt move and diagonal move are the novel operators, which were included to generate the core boundary. Acquired boundary forms HP mixed layer by establishing the probability of sub conformation layer.

2D HP model is applied to achieve this structure. The mechanism could be extended with the analysis of some additional parameters. In addition to that, the paper [8] explained about the inverse protein folding problem on 2D and 3D lattices using the Canonical model. Shifted slice and dice approach was also incorporated to design a polynomial time approximation scheme that solves the inverse protein folding problems and paves a way to analyze the protein landscapes. Moreover, in protein structure prediction problem, lattice model had been utilized for effective folding mechanism. The FCC (Face-Centered-Cube) HP lattice model provided the most compact core and that could map closest to the folded protein [14]. Hybrid Genetic Algorithm that supports square and cube lattice model was adopted for framing the 3D FCC model. The scholars developed a 3D model of crossover, conformation with optimality and mutation, when in fact we intend to frame protein folding on 2D. Numerous methods have been devoted for exact protein folding [13].

Figure 2: Illustration of Protein Structure Formation

The Figure 2 presented above reveals the protein structure formation from unfolds to 2D fold. Protein folding has been done using Hidden Markov Model in [1] which emphasizes the relationship between the parts of the entity and the whole. In a different manner, the trees were used as the parsing perspective on protein folding in [2]. Moreover, the process has been made with the hierarchical search for locally optimal structures. Another work in [4], introduced ABC (Artificial Bee Colony) optimization for 2D protein folding by applying it to HP lattice model. The reliability of the process could be further improved by banding some efficient conceits. Further, the work of the authors in [10] demonstrated about the speed limit for protein folding. It was also predicted in the process that most unknown ultrafast folding proteins can be devised to fold more than ten times faster.

Another approach for protein folding based on BCO (Bacteria Chemotaxis Optimization) was developed for 2D protein folding using lattice model. Foraging behavior of bacteria has taken into the account for framing the model. The algorithm has been applied effectively for proteins with small chain and become ineffectual on long chain protein sequences [6].


Folding of protein is an obscure and enigmatic mechanism. Proposed work comprehends Extended Genetic Algorithm integrated with a concealed Markov model for solving the prediction problem on protein folding. The ultimate goal is to exemplify the protein fold, which accords both the secondary structure of protein and sequence of amino acid. The amalgamation of the structural and sequential information of proteins is accomplished by the concealed Markov model. Furthermore, extended Bayesian classification method is deployed to frame protein sequence based on domains. The criterion analysis of protein sequence has been manipulated by examining identity, fitness correlation, sequence gap and similarity to ensure optimality. Figure 3 reveals the system design of proposed mechanism.

Figure 3: System Design of Proposed Mechanism

Protein Structure Formation

Our proposed work explores the use of Swarm intelligence to construct the protein structure. Typically, the principle portions of some advanced algorithms are incorporated to frame the protein structure. The encompassed algorithm includes BCO (Bee Colony Optimization), ACO (Ant Colony Optimization) and ABC (Artificial Bee Colony). With the support of swarm behavior, composed procedure constitutes to move optimally towards the food source. An optimal fit solution on protein structure formation is obtained on the basis of criterions computed above.

Structural Concealed Markov Model (SCMM)

When the domain based classification of protein sequence is culminated, notion of concealed Markov model is exploited to train and test the protein sequences. The conceit of average correlation between the parts of an entity and the whole is then exposed. The complex protein pattern emphasized in the proposed model is conceded as a constituent sequence Bi, which is made by string of symbols B and are relatively interrelated. An assumption is made that every is allocated to and is symbolized as local structure. The graphical representation of concealed Markov model is exposed in fig 4.

Figure 4: Graphical representation of Concealed Markov Model

The key benefit of SCMM in solving protein folding is the prediction capability of proposed model that made a wide analysis on past state and predict the future state with respect to that analysis. In SCMM training phase, protein sequences are trained under the deliberation of acquired local structures from a long sequence. With the given local structure, all the obtained protein sequences from database follow the similar procedure for training. Consecutively, trained sequences are then tested based on specific domains. Training and testing model in classification of protein sequences paves a way for accurate folding process.

Parameter Analysis

The crucial parameters captured for our analysis based on criterion folding approach includes identity, sequence gap, fitness correlation and similarity. According to the alignment of sequence, conceit of functional, structural and evolutionary consequences relation made in between the protein sequence is effectuated. The representation of criterions exploited in this framework is enumerated as follows:

Sequence Gap:

The maximal or sequential run of spaces prevails in a single sequence of given alignment is denoted as a term of sequence gap in protein interactions.


Evaluating the length of the acquired protein sequence or the identical position of the protein interaction in a sequence is termed as identity of a sequence.


To accomplish effective protein folding, the structural similarity of obtained sequences is evaluated. Furthermore, protein sequences are exemplified as similar when they possess equivalent arrangement of secondary structures and topological connections.

Fitness Correlation:

The bonding strength of the protein sequence is determined by evaluating fitness correlation.

EGA based protein folding

The proposed novel EGA algorithm for 2D protein folding is initialized with long pattern protein sequence (S) along with sequence length (L) in order to acquire the folded sequence in 2D. The sequential processing of classification that group the obtained protein sequence into domains, training of protein sequence and optimization of 2D protein folding is accomplished by the aforementioned procedure. The parameter F(S) and G(S) in the algorithm exemplifies the value of fitness correlation and sequence gap in the protein interactions. For instance, whenever the value of fitness reaches higher than the target fitness, it is stated that protein sequence proceeds with the process of folding. On the other hand, the protein sequence is discarded when it is not adapted for folding.

The EGA based folding algorithm adduced in this paper is treated with the obtained optimally fitted protein sequence. The validation of fitness correlation is more substantial as it examines the availability of fitness correlation prevails between the MAX-MIN values, which are established previously. Crossover and Mutation are the two fundamental operations enforced in this strategy.


The experimental evaluation of proposed algorithm is implemented by acquiring the data from transactional data set, which encompasses firm status of the company in the market. Initially the organized data are fed up to the training model to expand the evaluation of reduced dimensions. Based on the corresponding mean value, the missing responses for the items are altered. The variables and the total variance equivalent to the variable count explored in the analysis are then standardized by exploiting correlation (i.e. variance of each standardized variable has a value equivalent to 1). The standardized variable retains its initial metrics with the support of covariance matrix. In any event, peculiar consideration should be made on usage of variables when their variances possess coincidental characteristics.

The efficacy of the proposed work is accomplished by the implementation made using the dataset obtained from SCOP (Structural Classification of Proteins). It composes the protein features on the basis of statistical knowledge on amino acids, which incorporates composition, transition and distribution. For accurate structural analysis and classification, large pattern of obtaining protein sequences undergoes preprocessing stage. The resultant sample preprocessed protein sequences gained in the protein folding process are exposed in fig 5.

Figure 5: Sample Large pattern protein sequences

For afore stated notion, the correlation matrix is evaluated. Consecutively, the Eigen vectors and Eigen values are observed. Furthermore, the post evaluation measure of PCA requires the application of Kaiser Meyer-Olkin (KMO) metrics with the scope of empowering the complete results. Following, axiom for pattern analysis is evident. With the set of numerous sample data experimental analysis is illustrated, hereby the graphical representation for few sample data are enriched.

Training and Testing

In the acquired data set, 27 protein classes are conceded for demonstrative analysis. Hence, training model of the proposed system is constructed with 27 SCMM. With the deliberation of four kinds of secondary structures such as 'sheet', 'helix', 'extended' and 'turn', the protein sequences are trained. The local structure value is initialized as 4 (because of 4 secondary structures). Determination of four local structures integrated with categorized domains train the protein sequences fed up as input. Relatively, the data set incorporates 990 amino acid sequences. The technique namely m-fold cross validation is deployed for mapping the generalization power of SCMM based classifier. Finally, the 990 amino acid sequences are partitioned into 5 sets, wherein each set encompasses 198 sequences.

Performance Evaluation

When the training and testing of protein sequences is accomplished, Criterion based analysis is enforced based on the parameters namely, identity, fitness correlation, sequence gap and similarity. These criterions are evaluated to acquire the optimal sequence suited for 2D protein folding. The values determined for such criterions are signified in figure 6. According to those values, the sequences are optimized and the performance evaluation is made with the annexed results before/after optimization.

Figure 6: Results of Parameter based Analysis


This paper focuses mutual extensive solution for the dimensionality reduction accumulated with pattern classification. The dimensionality reduction is furnished to enhance the efficiency and accuracy of query; right pattern at right scenarios respectively. Our newly devised approach explores PPA theorem, which equips to reduce the dimensionality on plotting. Moreover, this paper projects the novel method for the application of protein fold recognition. With that affair, a novel approach for 2D protein folding is incorporated in our adduced work that folds varying length of amino acid sequence. Domain based classification results are afforded by the categorization of protein interactions with respect to Bayesian classification method. The SCMM based training model enforced in this paper effectively train and test the sequences. The fitness correlation of acquired sequences is significantly estimated based on criterion analysis. Following the evaluation, optimized sequences construct the protein structure with the support of swam intelligence. Further, process of 2D folding is implemented with the framed EGA. The superiority of the proposed work in 2D protein folding is experimentally evaluated with intensely unified sequence gaps of protein interactions.

We conclude this paper based on the perspective that the proposed work can be further enhanced and expanded for distinctive research areas in terms of valuable advancement.