This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
As mentioned in Chapter 1, wet lab techniques to predict protein-protein interaction produce results with high level of noise and lack of accuracy, so there is a need to find an alternative to predict protein-protein interaction and the most recent method used by researchers are computational techniques. Largely available experimental data and the advancement of computer science have provided a chance to predict protein-protein interaction with higher accuracy.
This chapter presents a review of the literature covering protein-protein interaction prediction in molecular biology and bioinformatics. The review of the literature begins with a definition of key terms used in the research and then gives an overview of the contents of the literature. The first review is about the role of proteins in molecular biology. Then it introduces and describes protein domain profiles and its relation to protein-protein interactions. After that, conventional experimental methods of protein-protein interaction prediction are being described and followed by review of computational approaches that has been done by other researchers. Besides that, machine learning algorithm, SVM (support vector machine) is being introduced as the computational technique that will be used in the current research. Trend and tendencies of protein-protein interaction prediction are also being presented in this chapter. Lastly, a summary of literature review is being discussed.
2.2 Proteins in Molecular Biology
Proteins which are known as polypeptides are organic compounds formed by amino acids linked together by peptide bonds. Peptide bond is a kind of interaction which joined the carboxyl and amino groups of adjacent amino acid residues. Generally, there are 20 amino acids where their sequences are defined by the sequence of a gene that is used to form proteins. All amino acids share the same common structure, which are 3 substituted groups include an amino group, a carboxyl group, a variable side chain (R group) and 1 hydrogen atom surrounding a central carbon atom. The size of protein ranged from 200 to thousands amino acids long.
As mentioned earlier, proteins are formed by the linkage of amino acids through peptide bond. There are 20 natural amino acid groups which can be divided into several category based on their properties are the essential material for the formation of proteins. These 20 amino acids possess their own nomenclature, abbreviation, symbol and properties and these will be shown in the table below:
Table 2.1: The amino acids, abbreviation,symbol and properties.
Neutral; forms disulfide bridges
Neutral; smallest amino acid
Positively charged; aromatic
Hydrophobic; start amino acid
Four kind of amino acid properties are generally described in the table above which includes hydrophilic, hydrophobic, positively charged and negatively charged. The hydrophilic amino acids tend to have water soluble components buried in their R group and hydrophilic proteins will have hydrophobic residues embedded inside the protein so that it will expose their hydrophilic features into aqueous solvent whereas it will be vice versa for hydrophobic amino acids. Similarly, positively charged proteins will bind to negative charged molecules and have surface rich with positively charged chain while negatively charged proteins will have surface rich with negatively charged amino acids and tend to bind with positively charged chain. The properties of the amino acid will basically define the type and properties of proteins.
Sequences of proteins are linear and one dimensional but most of the proteins fold into unique three dimensional structures. The folding of protein into structural shapes is known as native conformation. The three dimension conformational changes are very specific and it depends on the interaction among their amino acid sequences. Therefore it can be said that amino acid sequences determine the structure of protein and the structure of protein will determine the function. There are a few shapes that amino acid sequence can fold into and biochemists validate that there are 4 distinct protein structures in terms of molecular biology, primary structure, secondary structure, tertiary structure and lastly quaternary structure.
Primary structure is the linear sequence of amino acids which is generally represented using the amino acid abbreviations. When the amino acid sequences regularly repeat themselves into formations, it will form secondary structure. The protein secondary structures are stabilized by hydrogen bond which exists between backbone atoms of the protein. The most common folding of secondary structures is an extended spiral spring, the alpha helix, whose structure is maintained by many hydrogen bonds which are form between neighboring CO and NH groups. Alpha helical regions are rigid and rod-like. X-ray diffraction data indicate that alpha helix makes one complete turn for every 3.6 amino acids. The structure is very stable if all CO and NH group participate in the hydrogen bonding. The example of protein with alpha helix structure is keratin.
Another common secondary structure will be beta-pleated sheet. Beta-pleated sheet is the protein that makes silk, named fibroin. They are arranged in a parallel fashion, either running in same direction or opposite direction. They are also joined together by hydrogen bond formed between CO and NH group. It has high tensile strength but arrangement of polypeptide make beta-pleated sheet very supple.
When polypeptide chain bends and folds extensively, forming a precise, compact globular shape, these processes will form proteinââ‚¬â„¢s tertiary structure and it is maintained by various type of interaction bond. The most common bonding which involved in protein tertiary structure includes ionic bond, hydrogen bond, disulphide bond and hydrophobic interaction. Hydrophobic interaction is quantitatively most important and occurs when protein folds so as to shield hydrophobic side groups from the aqueous surroundings and exposing hydrophilic side chain at the same time. Example of tertiary structure is myoglobin formed in muscle where its function is to store oxygen.
Many highly complex proteins consist of more than one polypeptide chain. The separated chains are held together by hydrophobic interactions, hydrogen and ionic bonds. This kind of precise arrangement is known as quaternary structure. Haemoglobin is the red oxygen carrying pigments found in red blood cells and it is one of the most common quaternary structures among all proteins. It consists of polypeptide chain of two types,two alpha chains and two beta chain. Each chain carries a haem which one of molecule of oxygen binds
With the 3D conformational structure of protein, it will be able to carry out its function. On the other side, denaturation of proteins which is the loss of three dimensional shape of protein molecule will cause negative effect to the protein function in molecular biology. Although the protein sequence is unaffected after denaturation, but when the 3D structure is unfolded, the protein will no longer perform its biological function.Agents which will cause denaturation are heat or radiation, strong acids and alkalis,high concentration of salts, heavy metals organic solvents and detergents. Denaturation can be temporarily or permanent. If it is a temporary denaturation, renaturation might occur when the environment is suitable. Figure 2.1 shows the protein structure 3D conformation from primary structure to quaternary structure:
2.3 Protein Domain Profile
The concept of domain was proposed by Wetlaufer in 1973 after studies of X-ray crystallographic for lysozyme, papain and immunoglobins. Domain is firstly defined as stable units of protein structure that could fold autonomously.In the past recent years, domains are regarded as compact, semi-independent units (Richardson, 1981), where each domain contains an identifiable hydrophobic core (Swindells, 1995).
Figure 2.1: Protein primary, secondary, tertiary and quaternary structure.
Generally, Protein domains are considered the basic units for protein folding, evolution and function (Vogel et al., 2005). They can exist independently from the rest of protein chain. Usually, each domain forms a three dimensional structure and can be independently folded. Structural domain is an important component in many proteins. The size of domain varies from 25 to 500 amino acids length.Domains often form functional units, such as the calcium-binding EF hand domain of calmoduin. Because they are self-stable, domains can be "swapped" by genetic engineering between one protein and another to make chimera proteins. (Swetha, 2005-2009).
Studies of the domain profile of proteins are a crucial step in either molecular biology field or protein science. Structural studies of Nuclear Magnetic Resonance (NMR) and X-ray crystallography have been successful because of the consideration of modular nature of protein (Baron et al., 1991) such as contribution of domain boundaries in NMR studies which require relatively small protein for analysis (Pfuhl and Pastore, 1995). Besides that, according to research, it has been proved that using individual domain in a database search for related sequences are often more successful than using whole protein sequence (Sonnhammer and Durbin, 1994).
The importance of protein domain had successfully gained interest from researchers to identify and search the individual domain types from protein sequences. It is to be believed that information highly relevant to protein-protein interactions come from their domain structures. This is quite sensible, both evolutionarily and structurally, as domains are oftenevolutionarily conserved sequence units and they constitute thebuilding blocks of protein structures, largely accounting for thereciprocal interactions among the proteins to which they belong (Iqbalet al., 2008). A pair of proteins is considered as interacting pairs if one of their individual domains is interacting with each other. Organism such as Saccharomyces cerevisiaeare found to have many domains embedded in their protein structure and these biological features are now often used to combine with protein-interaction datasets as additional biological information to improve the accuracy of protein-protein interaction prediction.
2.4 Methods to Study Protein-protein Interactions Prediction
Protein-protein interaction refers to the interacting bond among different proteins pairs or group of proteins. It exists is all level of cell and carry out function in regulating biological processes through association and dissociation of protein molecules. Clearly, it has been shown that at molecular level, function of protein is not only being determined by a set of molecules but also how they interact with each other and the result of interaction such as signal transduction, biological and chemical reaction.
Due to the importance of protein-protein interaction in understanding protein function, many attentions have been given to protein-protein interaction prediction by researchers in recent years. Many techniques were developed by researchers to detect protein-protein interaction but each of the techniques has their own advantage and disadvantage and the major concern is with the accuracy of the prediction. Somehow, the techniques that were developed are proven to be limited in coverage especially experimental wet lab methods. It suffers from high false positive rate and high level of noise. This is the recent why computational methods are more preferred by researchers to predict protein-protein interactions nowadays.
Several high-throughput experimental methods were used to detect protein-protein interaction in large scale, generating large amount of interacting proteins data. One of the common wet lab methods used in the prediction is using Yeast two hybrid screening method. Yeast two hybrid system is a molecular biology technique used to discover protein interactions by testing for physical interactions such as binding between two proteins. It investigates the interaction between artificial fusion proteins inside the nucleus of yeast (Bartel and Fields, 1997). The advantage of this system is it can provide it will be able to reveal the interaction network among proteins when first applied.Moreover, this method is scalable which enable it to discover the interaction among various proteins. However, high false positive rate and high false negative rate are being reported in yeast two hybrid detections. The rate of false positive remains unknown, but it is to be believed that it reach as high as 50% (Deane et al, 2002).
Tandem affinity purification (TAP) detects interactions within the correct cellular environment (e.g. in the cytosol of a mammalian cell) (Rigaut et al., 1999).It involves creating a fusion protein with a designed piece, TAP tag to the protein of study. This method is a big step up that brings advantages compared to yeast two hybrid approach. The advantage of this technique is there can be real determination of interacting proteins quantitatively in vivo without the need of prior understanding towards protein complexes and the process in simple which offers high yield with less contamination of target proteins. Since TAP tag method requires two steps of purification process before it can be successfully carried out, so it might not be suitable to detect transient protein interactions. Besides that, the efficiency of TAP will decrease when different cellular environment is applied into predicting protein-protein interactions. It is a good technique to test permanent protein interactions and allow various degrees of control by changing the number of times for purification process.
Another most common and rigorous detection of protein-protein interaction is the co-immunoprecipitation method. Co-immunoprecipitation works by selecting an antibody which targets a known protein which it is a member of large protein complex. Targeting the known protein with antibody will increase the possibility of pulling out the known protein with several numbers of unknown proteins from the due to the tight binding of various proteins in a large protein complex. The unknown proteins will be identified by western blotting subsequently. Co-immunoprecipitation is considered as one of the most powerful experimental technique that is used regularly to discover and analyze protein-protein interactions. However this method might consume a lot of time and manpower because identifying the unknown protein from the complexes might need several rounds of precipitation and different type of antibodies to reduce the biased of the results. Besides that, co-immunoprecipitation only verify the existence of interactions among suspected interacting proteins which is shown physically and not identifying unknown interactions among proteins.
Protein microarray, also known as protein binding microarray or protein chips (Heng et al., 2001), provides a multiplex platform to detect protein-protein interaction. It uses a piece of glass called array where different molecules of protein haven been affixed at different location in an ordered manner. One of the problem of protein microarray is protein concentration in the biological samples may be differ in term of magnitude compared to mRNA. There are some others impressive set of experimental techniques used in protein-protein interactions prediction such as mass spectrometry (Gavin et al., 2002) and hybrid approaches (Tong et al., 2002). Table below shows the examples of wet lab methods used in protein-protein interaction predictions.
Table 2.2: Wet lab methods for protein-protein interaction predictions.