The main building blocks

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.



1.1 Background

Proteins are the main building blocks and machineries for all living organisms. They build up the cellular and mediate biological and metabolic processes.

Thousands of different proteins types are found in all living organisms. They play important roles in the activities inside the cell of the living organisms. In the human body, they are found in every cell and involved in most biological activities such as structural, enzymatic and material transportation, those are fundamental to the life.

Formation of protein inside the cell starts when the DNA transcribes the encoded genes into mRNA, which is translated by the ribosome into a sequence of amino acids that compose the protein. This is known as the central dogma of molecular biology.

Proteins are polymers of connected amino acids whose composition is encoded in genes. These amino acids are the basic building blocks of the protein. There are twenty amino acids types in nature. Each of them is denoted by a different letter (or three letters) as shown in Table 1.1. Proteins differ only by the sequential order and the number of amino acids. The length of the protein molecule can vary from few to many thousands of amino acids.

Each amino acid consists of two parts, a main chain or backbone and a side chain or R chain. The main chain is the same in all the amino acid types. The differences are in the side chain which determines the chemical properties of the amino acid. The main chain contains a central carbon (Ca) which is bonded with an amino group (--NH2), a hydrogen atom (H) and a carboxylic acid group (-COOH). The side chain is attached to the central carbon and is denoted in Figure 1.2 by R. There are 20 different side chain types in natural. Some are simple, made of only one atom and some are complicated containing many atoms.

Amino acids are connected with each other by a peptide bond. The peptide bond is formed between two amino acids when the carboxyl group of the first amino acid interacts with the amino group of the second amino acid. A water molecule is released due to this interaction as shown in figure 1.3.

Figure 1.3 also shows the backbone dihedral angles of the protein. The f angle is the angle that involves the backbone atoms C-N-Ca-C. The ? angles involve the backbone atoms N-Ca-C-N. The ? angle involves the backbone atoms Ca C N Ca. The amino acid also has side chain is also moblic and has many rotable bonds of different angles (not shown in the figure) which are varied by the type of the side chain. All these angles determine the amino acid conformation and control the overall protein folding.

Protein can be delineated through four different hierarchical levels as illustrated in Figure 1.4:

Primary structure: is the chain of amino acids sequence.

Secondary structure: is formed due to the interactions between the atoms of the main chain which results in local structure such as a-helix and ß-sheet.

Tertiary structure: is the three dimensional arrangements of the amino acid's atoms as the secondary structural elements packed together due to polarity and the interactions between side chains.

Quaternary structure: a protein which consists of several protein subunits (domains) held together.

1.2 Protein structure prediction problem

The determination of the protein sequence from the genes encoded in the DNA is known as the first genetic code. The determination of the protein structure from the amino acid sequence is considered as the determination of the second genetic code (Chan and Dill, 1993, Hardin et al., 2002).

Proteins play a vital role in the biological processes of the human body. Significantly, a protein can only be able to perform its biological function when it folds into its tertiary structure. This tertiary structure is known as the biological active state or the native state. Moreover, many of the drugs become effective when their structures are closely associated with the structure of the proteins (Ogura et al., 2003).

Through the knowledge of the protein tertiary structure, much valuable information can be revealed. This information is essential (Greenwood and Shin, 2002) in helping the scientists to get a better understanding of the protein functionality and understanding of many diseases that happened due to protein mis-folding (Schlick, 2002). In the light of this understanding, scientists can design new drugs that interact with targeted proteins and modify their functions (Cheng Che et al., 1994), and design new drugs that can cure diseases (Greenwood and Shin, 2002, Schlick, 2002, Yun-Ling and Lan, 2006).

The protein structure prediction problem is simply stated: given the protein sequence, what is its tertiary structure?. Resolving this problem is not as simple as its statement. This problem is regarded as a great challenge in many of the scientific disciplines. It is one of the great challenges in structural biology (Brock and Brunette, 2005), and it is a fundamental scientific problem and a grand challenge in computational biology, chemistry (Floudas, 2007) and Bioinformatics (Kanehisa, 1998, Meidanis, 2003, Helles, 2008). This problem is one of the unresolved problems in biophysics (1R 1998). Solving this problem can be attained by using experimental methods and computational methods (Figure 1.5).

1.2.1 Experimental Protein structure determination

Experimental methods of determining protein structure are the main trusted source of information about protein structure (Cheng Che et al., 1994) of which, Nuclear Magnetic Resonance (NMR) and X-ray crystallography are mostly used. However, none is free of drawbacks. They are difficult, time-consuming, laborious and expensive. As they need special equipments and human efforts, the determination of the structure of a single protein may take from several months up to years of lab work. Moreover, not all protein structures can be determined using experimental methods (Evans et al., 1995, Zhang, 2002b). As an example, NMR method can determine the structure of proteins, which are not longer than 100 amino acids (Jones, 2000), while protein crystallizability is a prerequisite for X-ray method. So it cannot be applied for all proteins because not all proteins can be crystallized (Cheng Che et al., 1994).

Because of these limitations in the experimental methods, experimental protein structure determination is still slower than sequence determination. Various genome projects are identifying new genes, which are much more than the number of protein structures that are determined by experimental methods (Karl-Heinz, 2003). This results in a big gap between the number of known protein sequences and the number of determined protein structures. This gab is obvious as the difference between the number of protein sequences deposited in sequence databases and the number of structures deposited in structure databases. The number of sequences in Swissprot database (UniProtKB/Swiss-Prot Release 57.1) until 14-Apr-2009 was 412525 entries while the number of protein structures in Protein DataBase (PDB) up to ********* was 57013 structures. To resolve this gap, other fast methods of protein structure determination are needed.

1.2.2 Computational protein structure prediction

Because of the challenges in the experimental determination of many of the proteins structures, scientists from many fields such as Biology, Computer Sciences, Mathematics, Biochemistry, and Physics are working to develop theoretical and computational methods to help in predicting the tertiary structures of the proteins. Theoretical methods are important tools to help biologists in obtaining protein structure information (Zhang, 2002a) because of their easiness, and because they provide a cost-effective solution to the accurate prediction of protein structure (Beiersdorfer et al., 1997, Greenwood and Shin, 2002, Ogura et al., 2003). By using computational methods, the structure of the large number of protein sequences that cannot be determined experimentally will be resolved (Baker and Sali, 2001).

Computational methods are traditionally classified into three approaches: Homology Modeling, Threading and Ab initio. These three approaches are further classified into knowledge-based or non optimization methods (Homology Modeling and Threading) and optimization methods or first principle methods (ab initio). The recent classification of computational methods differentiates between the ab initio methods that use database information and ab initio method which do not (Floudas, 2007).

In Homology Modelling and Fold Recognition methods, the prediction is performed using the similarities between the target protein sequence and the sequences of already determined proteins structures. So, these methods are limited to predict the structure of proteins, which belong to protein families with known structures. On the contrary, Ab initio methods are not limited to protein families with at least one known structure. Predicting the protein structure using the ab initio methods is one of the top ten challenges in Bioinformatics (Meidanis, 2003). They are based on the Anfinsen thermodynamic hypothesis (Anfinsen, 1973) which states that the tertiary structure of the protein is the conformation with the lowest free energy.

Based on Anfinsen thermodynamic hypothesis the protein structure prediction problem is formulated as an optimization problem (Morales et al., 2000, Garduno-Juarez et al., 2003, Ogura et al., 2003, Crivelli and Head-Gordon, 2004, Bortolussi et al., 2005, Vengadesan and Gautham, 2006, Yun-Ling and Lan, 2006), and the goal is to search the protein conformational search space to find the lowest free energy conformation. In order to perform that, proper representation of protein conformation is required. Based on the treated degrees of freedom, this representation is ranged from all atoms representation to simplified or reduced representation. An energy function is used to calculate the conformation energy while a conformational search algorithm is utilized to search the conformation search space to find the lowest free energy conformation.

Conformational search algorithms explore the protein conformational search space with a major goal to find the lowest free energy conformation (Zhang, 2002a). Searching the protein conformational space is a grand challenge in protein tertiary structure prediction due to the large number of possible conformations and the local minima problem. In general, if a protein has n atoms, the degree of freedom is 3n-6. Accordingly, a protein with 100 amino acids where each amino acid has 20 atoms, the number of degree of freedom is equal to ([(100*20)*3]-6=5994) (Schulze-Kremer, 2000b). In other words, by considering 5 torsional angles for each of the 100 amino acids and take 5 values for each angle, the number of possible conformations will be 25100.

Protein structure prediction problem which involves searching the conformational search space for the lowest free energy is a hard combinatorial optimization problem (Lee et al., 1997, Greenwood and Shin, 2002). It is a NP-hard (Khimasia and Coveney, 1997, Morales et al., 2000, Garduno-Juarez et al., 2003) or even NP-Complete problem (Seung-Yeon et al., 2003, Bortolussi et al., 2005). Algorithms need an exponential time to search the protein conformational search space. Similar searching for "a needle in a haystack" (Dill, 1993). It is impractical to test all the feasible conformations to find the lowest free energy conformation. Therefore, success in the prediction of the protein tertiary structure is dependent on the efficiency of the searching method to pass over different conformations without testing all conformational possibilities (Zhou and Abagyan, 2002) and without regard to folding processes (Morales et al., 2000, Day et al., 2003) .

There is a need for search methods that are robust, and efficient. Since the problem is a combinatorial optimization problem, huge numbers of optimization algorithms have been developed to search the protein conformational space of which Monte Carlo Simulation (Ripoll and Thomas, 1990, Evans et al., 1995), Simulated Annealing (E. and J., 1997, Ogura et al., 2003, Tanimura et al., 2004, Yun-Ling and Lan, 2006), and Genetic algorithms (Unger and Moult, 1993, Schulze-Kremer, 1994, Gates et al., 1995, Schulze-Kremer, 1996, Beiersdorfer et al., 1997, Khimasia and Coveney, 1997, Schulze-Kremer, 2000a, Xiang, 2000, Garduno-Juarez et al., 2003, Madhusmita et al., 2008) are the most commonly used algorithms.

A new research area in computational science is now emerging, which is based on the inspiration of nature and biology. It aims to propose computational algorithms that can be applied to solve a wide diversity of complex optimization problems. These algorithms can, under suitable conditions, outperform the existing conventional algorithms (Yang, 2005). Swarm intelligence is a new active research field which belongs to this kind of algorithms (Merkle and Middendorf, 2005). The first use of the term Swarm Intelligence was in 1988 by Beni to describe a cellular robotic system (Geartner, 2004). Swarm Intelligence can be defined as the study and modelling of the collective intelligence behaviour of social insect colonies and other animal societies (Bonabeau et al., 1999) to inspire algorithms for solving real world and search problems (Abbass, 2001a). Theoretically, Swarm Intelligence based algorithms can be used to solve every possible problem (Geartner, 2004).

Swarm intelligence behaviour of social insects arises from their prosperities and characters such as self-organization and labour division (Bonabeau et al., 1999, Ajith et al., 2006). This behaviour enables social insects to solve their daily problems in the environment very efficiently. Social insects accomplish this by the distribution of tasks among the colony members, and by the self organization and the adaptation to the changes in the environment and their robustness where the work is continuing when some of the individuals fail (Bonabeau and Meyer, 2001).

Problems that social insects deal with in the environment are equivalent to optimization problems in the actual world. Researchers have deliberated on the behaviour of social insects for the last 25 years and constituted mathematical models that delineate their behaviours. Scientists used models of ants and other social insects behaviour to propose algorithms to solve complex optimization problems (Bonabeau and Theraulaz, 2000). These algorithms have the ability to search the solution search space of the problem efficiently in a way similar to the foraging search behaviour of social insects (Bonabeau et al., 1999, Bonabeau and Meyer, 2001). Swarm intelligence algorithms that are based on social insects behaviour are successful in solving combinatorial optimization problems. They have many computational advantages (Hyeong Soo 2006) and begin to show their power and effectiveness in many applications (Lucic 2002).

Ant behaviour was extensively studied and a wide number of algorithms had been proposed for solving different real world problems. However, honey bees behaviour is not as extensively studied, although it seems to have the same features as ant social insects behaviour (Lemmens 2006). There is a wide potential for new optimization methods derived from the model of bee behaviour, analogous to the ant optimization (Dornhaus et al., 1998).

Recent years showed the application of Swarm Intelligence based algorithms in solving Bioinformatics problems (Das et al., 2008). In the protein structure prediction problem, the idea of using the cooperative and collective behaviour of social insects to search the protein conformational search space was addressed by Huber and van Gunsteren (1998). They used a swarm of molecules, which cooperate with each other to search the protein conformational search space using Molecular Dynamics (Huber and Van Gunsteren, 1998). Ant Colony Optimization was used to predict the structure of the protein using the simplified protein representation. Particle Swarm optimization (Ayan et al., 2008), was also used to predict the tertiary structure of the protein. Previously unsolved old problems can be insightfully investigated using algorithms inspired from honey bees behaviour (Olague and Puente, 2006). Using the principles of honey bees colony, the difficult combinatorial optimization problems such as protein tertiary structure prediction can be solved.

Swarm Intelligence based algorithms that are inspired by the behaviour of the honey bee colony can be classified into three classes. Algorithms which are inspired by the process of reproduction (marriage) in the honey bee colony, algorithms which are inspired by the foraging behaviour of the bee colony, and algorithms which are inspired by the behaviour of the queen bee. These algorithms have been applied to many applications and optimization problems.

Scope of the study

This study will focus on protein tertiary structure prediction problem using ab initio methods, in particular, the protein conformational search problem. The representation of the protein conformation is the main chain and side chain amino acid torsion angles.

Research objectives

The goal of this research is to investigate the protein tertiary structure prediction problem using a spectrum of Swarm Intelligence algorithms. In particular, the adaptation of algorithms inspired by the honey bee colony to search the protein conformational search space for the lowest free energy conformation. Since searching the protein conformational search space is computationally expensive, there is potential that the adapted algorithms be parallelized.

The objectives of this research are:

  1. To explore the suitability of the honey bee colony inspired algorithms for the protein conformational search problem.
  2. To enhance the protein conformational search algorithm using the foraging behaviour of honey bee colony and process of reproduction of the honey bee colony.
  3. To incorporate parallel techniques into the protein conformational search algorithms.

Layout of thesis

The body of this thesis consists of 6 chapters. The organization of the chapters is as follows:

Chapter two

This chapter will cover the computational protein structure prediction methods and concentrate on conformational search algorithms.

Chapter three

This chapter will cover the algorithms inspired by honey bee colony.

Chapter four

The adaptation of the honey bee algorithms to the protein conformational search and their results.

Chapter five

Parallel protein conformational search methods and results.

Chapter six

The conclusion