Bioinformatics Information Processes In Biotic Systems Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The term bioinformatics was coined by Paulien Hogeweg, in 1978. He said that the term Bioinformatics referred to the study of information processes in biotic systems.

Bioinformatics, the application of computational techniques to analyze the information associated with biomolecules on a large-scale, has now firmly established itself as a discipline in molecular biology, and encompasses a wide range of subject areas from structural biology, genomics to gene expression studies [2]. Bioinformatics is the application of computer technology for the management of biological information. Computers are used to gather, store, analyze and integrate biological and genetic information which can then be applied to gene-based drug discovery and development. The need for Bioinformatics capabilities has been precipitated by the explosion of publicly available genomic information resulting from the Human Genome Project. The goal of this project - determination of the sequence of the entire human genome (approximately three billion base pairs) - will be reached by the year 2002. The science of Bioinformatics, which is the combination of molecular biology with computer science, is essential to the use of genomic information in understanding human diseases and in the identification of new molecular targets for drug discovery.

Bioinformatics is a sub field of the biological sciences. It is highly interdisciplinary and draws upon in applied mathematics, statistics, and computer science for solving biological problems. In practical bioinformatics manage and analyze large biological database using a combination of computer science, statistics, and biology. Simply the bioinformatics is a branch of biological science which deals with the study of methods for storing, retrieving and analyzing biological data, such as nucleic acid  (DNA/RNA) and protein sequence, structure, function, pathways and genetic interactions. It also generates new technique in drug design field so we can say that it is an important application in the field of bioinformatics. And development of new software tools for drug designing and also for other application in bioinformatics[2]. The main aim of bioinformatics is to create and store useful information for the researchers, and store it in an easy access format, known as database. It allows the users to retrieve existing information and submit new entries as they are produced, example: PDB database[2], protein databank for macromolecular structure. It contains all the information about the protein sequence and structures. Mainly the PDB is known as secondary structure database. Including database creation, the next aim of bioinformatics is to develop tools and resources to analyze the data that provided by the databases and the researcher's different data. Example: BLAST(Basic Local Alignment Search Tool), it is a sequence alignment tool, it align the query sequences to the database sequences and find the similarities based on alignment values i.e. based on E-values. The E-value is less the alignment is considered as a good alignment. The third aim is to use these tools to analyze the data and interpret the results in a biologically meaningful manner.

Different databases and its URL (table1) [2]



Protein sequence




Protein sequence(composite)



Protein sequence(secondary)




Macromolecular structures

Protein Data Bank (PDB)

Nucleic Acids Database (NDB)

HIV Protease Database






Nucleotide sequences




Genome sequences

Entrez genomes



Integrated databases


SRS(sequence retrieval system)


The development of the software for the drug designing and in sequence analysis is an important application of Bioinformatics.  Bioinformatics also deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, structural biology, software engineering, data mining, image processing, modeling and simulation. Commonly used software tools and technologies in this field include Java, XML, Perl, C, C++, Ruby, Python, R, MySQL, SQL, CUDA, MATLAB, and spreadsheet applications. Practical applications are finding homologous, one of the driving forces behind bioinformatics is the similarity search between different biomolecules [2]. Apart from enabling systematic organization of data, identification of protein homologues has some direct practical uses. Potential drug targets are discovered very quickly by checking whether any homologues of essential microbial proteins are eliminated or removed from humans. Structural differences between similar proteins may be difficult to design drug molecules that specifically bind to one structure but not another. Rational drug designing is another application; rational drug designing is one of the earliest applications in medical field [2]. And in Large-scale censuses, although databases can efficiently store all the information related to genomes, structures and expression datasets, it is useful to condense all this information into understandable trends and facts that users can readily understand. Broad generalizations help identify interesting subject areas for further detailed analysis, and place new observations in a proper context. This enables one to see whether they are unusual in any way. Through these large-scale censuses one can address a number of evolutionary, biochemical and biophysical questions [2].

Bioinformatics also have some medical applications, most recent applications in the medical sciences have centered on gene expression analysis. This usually involves compiling expression data for cells affected by different diseases, e.g. cancer and arteriosclerosis, and comparing the measurements against normal expression levels. Identification of genes that are expressed differently in affected cells provides a basis for explaining the causes of illnesses and highlights potential drug targets[2].

Major research area: sequence analysis, genome annotation, computational evolutionary biology, literature analysis, analysis of gene expression, analysis of gene regulation, analysis of protein expression, analysis of mutation in cancer, comparative genomics, modeling biological system, high throughput image analysis, structural bioinformatics approach, molecular interaction, docking approaches.


Immunology is a branch of biomedical science that covers the study of all aspects of the immune system in all organisms[3]. The immune system is an adaptive defense system that has evolved in vertebrates to protect them from invading pathogenic microorganisms and cancer. It is able to generate large number of cells and molecules that can recognize and eliminate the variety of foreign invader. Immune response can divided into two related activities- recognition and response[3]. Immune response is remarkable for its specificity the immune system recognize subtle chemical differences that distinguish one foreign pathogen from another. And also the system is able to discriminate between foreign molecule from the body's own cell and proteins. And participates variety of molecule for immune response for eliminating the pathogens is called effector response.

There are two different mechanisms involved in the immune systems response, innate immunity and adaptive immunity. The innate immunity is nonspecific, general defense mechanism against different microorganism that is harmful to the body [3]. The innate immunity comprises four types of defense barrier, anatomic barrier this includes skin, mucous secretion, tears, saliva, Physiologic barrier includes temperature, pH, chemical mediator, Phagocytic barrier conducted by macrophages, eosinophil, neutrophils [3], and inflammatory barrier. Adaptive immunity is specific, and capable of recognizing and selectively eliminating specific foreign microorganisms and molecules (foreign antigens).the features of the adaptive immunity is , Antigenic specificity, Diversity, Immunologic memory, Self/ nonself recognition.

Cells of immune system are B lymphocytes and T lymphocytes.

The B lymphocytes (B cell) which directly bound to the foreign pathogens (antigen) and give response. B cell is developed and matured in bone marrow. And each expresses a unique antigen binding receptor on its surface. The receptor in the membrane bound antibody molecule. Antibody is a glycoprotein.

The T lymphocytes (T cells) it also arise in the bone marrow. Unlike B cell which is mature within the bone marrow, T cell migrates to the thymus gland to mature. During maturation within the thymus, the T cell comes to express a unique antigen binding molecule, called T cell receptor, on its membrane. The antibodies on the B cell can recognize antigen alone, but in T cell receptor can recognize antigen only when it bound to cell membrane proteins called major histocompatibility complex (MHC) molecules. The MHC classified as three, MHC class I, class II, class III. The MHC class I molecule found in all nucleated cell surfaces and presentation of bounded antigen to Tc cells (T cytotoxic cell). The MHC class II proteins are expressed primarily on antigen presenting cell (APC), i.e. macrophages, basophils, dendritic cell etc. and the presentation of processed antigen to Th cells ( T helper cells). Activation of both humoral and cell mediated immune system require cytokines produced by the T helper cells, the activation of the Th cells can regulate its activation itself. They can only recognize the antigen that bounds with MHC class II molecules. The antigen presenting cell first takes the antigen in to the cell by phagocytosis or by endocytosis. And they displayed it on cell membrane in a bound form with MHC class II molecules. The Th cells and the Tc cells are distinguished from one another by the presence of either CD4 or CD8 membrane glycoproteins on their surface. The CD4 functions with Th cells and the CD8 functions with Tc cells. A Tc cell recognize an antigen-MHC class I molecule complex proliferate and differentiated into an effector cell called cytotoxic T lymphocytes (CTL). The CTL generally does not secrete cytokine and instead exhibit cytotoxic activity. The cells that display the antigen with MHC class I molecule called altered self- cell.

The antibody is produced if any unwanted microorganism or any harmful changes occur in gene expression. The antibodies are not directly bind to the antigen, and all antibodies are specific to each antigen. The antibodies that recognize the part of antigen for binding, this antigenic part is called antigenic determinant or epitope.


An epitope is also known as antigenic determinant. It is a part of antigen that is recognized by the immune system, i.e. by antibodies, B cells, and T cells. The part of antibody that recognizes the epitope is called paratope. The epitopes are usually non- self-protein, sequence derived from the host that can be recognized are also epitopes. The epitopes are divided into two categories, conformational epitope and linear epitope, based on their structure and interaction with Paratope. The interaction between the epitope and the paratopes are based on 3D surface features and shape of antigen tertiary structure. Most epitopes are conformational. An antibody epitope, aka B-cell epitope or antigenic determinant, is a part of an antigen recognized by either a particular B cell receptor or by a particular antibody molecule , i.e. may be a T cell receptor of immune system [17]. The protein antigen, an epitope may be either a short peptide from the protein sequence, called a continuous epitope, or a patch of atoms on the protein surface, called a discontinuous epitope. While continuous epitopes can be directly used for the design of vaccines and immunodiagnostics. If continuous epitopes are predicted using sequence-dependent methods built on available collections of immunogenic peptides [18], discontinuous epitopes that present in a whole protein, pathogenic virus, or bacteria is recognized by the immune system are difficult to predict or identify from functional without the information of a three-dimensional (3D) structure of a protein [19,20].

The epitopes are many two types B cell epitope, T cell epitope.

1.3.1 B cell epitope

The interaction between antibodies and antigen is important immune system mechanism for removing or clearing infectious organism or cell from the host. Antibody binds to the antigen at a site referred to as B cell epitope. The identification of the location of epitope is essential in several biochemical applications such as: rational vaccine design, development of disease diagnostics and also in immune therapeutics. The antigen antibody interactions are key events in immune system, which leads to immune process and responses. The specific site on the antigen is directly binds to the antibody produced in B cell are known as B cell epitope. Identification of epitope is very important in epitope based drug designing.

B cell epitope is a set of residues in an antigen that can be recognized by antibody to activate an immune response. B cell epitopes are two types: linear epitope and conformational epitope. About 10% of them are linear epitope and the remaining 90% are conformational epitopes. The linear epitope also called as sequential or continuous epitope. Whereas conformational epitopes also called non sequential or discontinuous epitopes. The linear epitope differ from the conformational epitope in the continuity of their residues in primary sequence. The residues of linear epitopes are continuous in primary sequence while the residues in conformational epitopes are not. B cell epitope can be used to synthesize peptides that elicit the immune response with specific cross reacting antibodies.

1.3.2 T cell epitope

The T cell receptor can recognize only when the antigen or foreign particles bound to a membrane protein called major histocompatibility complex (MHC). The MHC is mainly divided into three classes, class I, class II, class III. The MHC class I molecule present only in nucleated cell. For the major histocompatibility complex(MHC) class I-mediated immune response, this immune activation is necessary for a successful processing of antigen, the presentation of bound antigen to Tc cell(T cytotoxic cell). The antigen bounded with the MHC molecule is finally recognized by T cell receptor molecule. In the case of antigen bounded to MHC class I molecule is recognized by Tc ( T cytotoxic cell)cell, and antigen bounded to MHC class II molecule is recognized by Th ( T helper cell) cell. The predictions of antigen processing and MHC peptide binding or the prediction of T cell activity is a very difficult process.

The immunogenic pathway can be divided into two major phases, phase I includes all the process involving the antigen presenting cell. For MHC class I, this phase includes proteasome cleavage, peptide transport, the binding of peptide to the MHC molecule, and its presentation on the cell surface. Phase II it include the recognition of this MHC-peptide complex by T cell receptor leading to T cell activation mainly the T cell is activated in phase II.

The epitope prediction is very important in cancer. The main causes of cancer are chemicals, gene expression, virus, and UV radiations etc. simply cancer is an uncontrolled cell growth, it may be due to some chemicals, virus, or gene expression (up regulation or down regulation).

Here in this project TMEFF2 expression is analyzed, using different tools and predicted the epitopes present in the gene.

In present condition there are different cancers are there, it become a challenge in human life.


Main cancer causing part in human body (fig1).

Cancer is unregulated cell growth, the cancer cells divides and grow uncontrollably. In general Cancer cells are two types malignant and benign. The malignant tumors are formed by the uncontrolled division and growth of the cell, this type of cancer may be spread to adjacent and more distant part of the body through lymphatic system and blood stream. Not all tumors are cancerous. Benign tumor which don't divide or grow uncontrollably, and don't invade the neighbour tissues. At present there are 200 different cancers found in human. Cancer is medically known as malignant neoplasm, is a broad group of various diseases.

Cancer grows out of normal cells in the body. Normal cells multiply when the body needs them, and die when the body doesn't need them. Cancer appears to occur when the growth of cells in the body is out of control and cells divide too quickly. It can also occur when cells forget how to die. There are many different kinds of cancer. Cancer can develop in almost any organ or tissue, such as the lung, colon, breast, skin, bones, or nerve tissue.

There are many causes of cancer, including : benzene and other chemicals, drinking excess alcohol, environmental toxins, such as certain poisonous mushrooms and a type of poison that can grow on peanut plants (aflatoxins), excessive sunlight exposure, genetic problems, obesity, radiation, Viruses. However cause of many cancers is unknown. The most common cause of cancer-related death is lung cancer.

The three most common cancers in men in the United States are: Prostate cancer, Lung cancer, Colon cancer. In women in the United States, the three most common cancers are: Breast cancer, Colon cancer, Lung cancer. Some cancers are more common in certain parts of the world. For example, in Japan, there are many cases of stomach cancer, but in the United States, this type of cancer is unusual. Differences in diet or environmental factors may play a role. Some other types of cancers include: Brain cancer, Cervical cancer, Hodgkin's lymphoma, Kidney cancer, Leukemia, Liver cancer, Non-Hodgkin's lymphoma, Ovarian cancer, Skin cancer, Testicular cancer, Thyroid cancer, Uterine cancer.


Cancers are classified by the type of cell that the tumor cells resemble,

1) Carcinoma: Cancers derived from epithelial cells. This group includes many of the most common cancers, particularly in the aged, and include nearly all those developing in the breast, prostate, lung, pancreas, and colon.

2) Sarcoma: Cancers arising from connective tissue (i.e. bone, cartilage, fat, nerve), each of which develop from cells originating in mesenchyme cells outside the bone marrow.

3) Lymphoma and leukemia: These two classes of cancer arise from hematopoietic (blood-forming) cells that leave the marrow and tend to mature in the lymph nodes and blood, respectively. Leukemia is the most common type of cancer in children accounting for about 30%.

4) Germ cell tumor: Cancers derived from pluripotent cells, most often presenting in the testicle or the ovary (seminoma and dysgerminoma, respectively).

5) Blastoma: Cancers derived from immature "precursor" cells or embryonic tissue. Blastomas are more common in children than in older adults.


Symptoms of cancer depend on the type and location of the cancer. For example, lung cancer can cause coughing, shortness of breath, or chest pain. Colon cancer often causes diarrhea, constipation, and blood in the stool. Some cancers may not have any symptoms at all. In certain cancers, such as pancreatic cancer, symptoms often do not start until the disease has reached an advanced stage. The symptoms can occur with most cancers are Chills, Fatigue, Fever, Loss of appetite, Malaise, Night sweats, Weight loss.


Symptoms of metastasis are due to the spread of cancer to other locations in the body. They can include enlarged lymph nodes (which can be felt or sometimes seen under the skin and are typically hard), hepatomegaly (enlarged liver) or splenomegaly (enlarged spleen) which can be felt in the abdomen, pain or fracture of affected bones, and neurological symptoms.


Biopsy of the tumor, Blood tests (which look for chemicals such as tumor markers), Bone marrow biopsy (for lymphoma or leukemia), Chest x-ray, Complete blood count (CBC), CT scan, Liver function tests , MRI scan. Most cancers are diagnosed by biopsy. Depending on the location of the tumor, the biopsy may be a simple procedure or a serious operation. Most patients with cancer have CT scans to determine the exact location and size of the tumor or tumors. A cancer diagnosis is a difficult process.


Surgery, Chemotherapy, hormone therapy, radiation is the common treatment for cancer. There is no proper cure for cancer, we all know that cancer treatments cause may side effects in human body.


TMEFF2 is a gene encoding a plasma membrane protein with two folliststin like domain and one epidermal growth factor like domain had limited normal tissue distribution and was highly over expressed in prostate cancer. It also named as HPP1,TENB2, TPEF. HPP1 encodes hyperplastic polyposis protein. It is highly expressed in adult and fetal brain, spinal cord and prostate. This gene is expressed in all brain regions except the pituitary gland, with highest levels in amygdala and corpus callosum, expressed in the pericryptal myofibroblasts and other stromal cells of normal colonic mucosa, expressed in prostate carcinoma and Down-regulated in colorectal cancer. Present in Alzheimer disease plaques (at protein level). Isoform 3 is expressed weakly in testis and at high levels in normal and cancerous prostate. Methylation in TMEFF2 may leads to esophageal adenocarcinoma. Down-regulated in tumor cell lines in response to a high level of methylation in the 5' region. The CpG island methylation correlates with TMEFF2 silencing in tumor cell lines.



2.1 TMEFF2

The trans membrane protein with epidermal growth factor and two follistatin domain is expressed in normal prostate and brain and is over expressed in prostate cancer. In several studies it suggests that TMEFF2 plays role in suppressing and invasive potential of human cancer cells. Whereas some other studies suggest that sheded portion of TMEFF2 lack cytoplasmic regions, has a growth promoting activity. According to Chen X, they suggests thatTMEFF2 has dual mode of action, Ectopic expression of wild-type full-length TMEFF2 inhibits soft agar colony formation, cellular invasion, and migration and increases cellular sensitivity to apoptosis. However, expression of the ectodomain portion of TMEFF2 increases cell proliferation.

The TMEFF2 is frequently methylated in esophageal adenocarcinoma (EAC).The major factor for the development of esophageal adenocarcinoma (EAC) is the replacement of squamous epithelium with the columnar epithelium, is known as batter's esophagus. According to Eric Smith, APC, CDKN2A, ID4, MGMT, RBP1, RUNX3, SFRP1, TIMP3, and TMEFF2, frequently methylated in multiple cancer types. The methylation frequency for each of the nine genes in the metaplastic BE (95%, 28%, 78%, 48%, 58%, 48%, 93%, 88% and 75% respectively) was significantly higher than in the squamous samples except for CDKN2A and RBP1. The methylation frequency did not differ between BE and EAC samples, except for CDKN2A and RUNX3 which were significantly higher in EAC. Methylation density is greater in EAC than in metaplastic (cell type conversion) BE for all genes except APC, MGMT and TIMP3. There was no significant difference in methylation extent for any gene between high grade dysplastic (change of phenotype (size,shape and organization of tissue)) BE and EAC.

According to Tsunoda S. the genes CLDN6, FBN2, RBP1, RBP4, TFPI2 and TMEFF2 was an association between reduction of methylation and increase in mRNA expression in the demethylated cell lines. The frequency of methylation of these six genes is higher in esophageal adenocarcinoma. Then he reported, gene silencing by methylation of CLDN6, FBN2, RBP4, TFPI2 and TMEFF2 in esophageal squamous cell carcinoma.

According to Matthias P A Ebert, suggest that methylation of TPEF ( trans membrane protein with epidermal growth factor) gene reported in human colon, gastric, and bladder cancer cells. TPEF/HPP1 was frequently methylated in primary colorectal cancers. Accordingly, incubation of the two cancer cell lines with the methylation inhibitor, 5-aza-2′-deoxycytidine, led to the restoration of TPEF mRNA levels in both cell lines (LoVo, DLD-1) In addition, also determined the levels of TPEF mRNA in a subset of primary colorectal cancers. Using total RNA and RT-PCR analysis, then identified various levels of TPEF mRNA in these cancer tissues. TPEF gene methylation was significantly more frequent in cancers of the colon compared to cancers of the rectum. TPEF/HPP1 may present both as a Trans membranous or soluble molecule, with an EGF module and two follistatin modules in the extracellular domain, and also it contains a potential G protein-activating motif in the cytoplasmic domain.

Based on the studies of Raymond F. Sullivan through BLAST search identified a sequence of HPP1 which can cause breast cancer (AF264150 [GenBank] ), the identified gene is hyper methylated in hyperplastic ( abnormal increase in cells) polyps of the colon. HPP1 was strongly expressed in normal prostate, testis, ovary, small and large intestine, esophagus, stomach, and liver, but weakly or not expressed in spleen, thymus, and peripheral blood leukocyte. HPP1 was strongly expressed in colon cancers and in an ovarian cancer, whereas weak or no HPP1 expression was seen in cancers of the breast, lung, prostate, and pancreas, breast cancer cell lines. And finally he concluded that the gene HPP1 is hyper methylated in breast cancer. But it doesn't appear to be hyper methylated in preinvasive breast cancer. And he also mentioned that the HPP1 gene is expressed mostly in normal tissue but not in normal cells.

TMEFF2 causing cancers

TMEFF2 is a gene encoding a plasma membrane protein with two folliststin like domain and one epidermal growth factor like domain had limited normal tissue distribution and was highly over expressed in prostate cancer. It is highly expressed din prostate cancer. Methylation in TMEFF2 causes esophageal adeno carcinoma. TPEF and HPP1 are highly methylated in primary colorectal cancer. From the research, found that HPP1 is hyper methylated in hyperplastic polyps of colon, this may leads to breast cancer. So we can conclude that hyper methylation of HPP1gene leads to breast cancer. And HPP1 is highly expressed in colon cancer and ovarian cancer. So we can conclude that the over expression and methylation in TMEFF2/HPP1 gene cause different cancers in human. In some cases it may not be expressed or weakly expressed in some cancers like breast, lung, prostate, and pancreas, breast cancer cell lines.



The MHC binding prediction are based on artificial neural network. This method generates high accuracy prediction of major histocompatibility complex. The predictions are based on artificial neural networks trained on data from 55 MHC alleles (43 Human and 12 non-human), and position-specific scoring matrices (PSSMs) for additional 67 HLA alleles.  Predictions are possible for peptides of length 8-11 for all 122 alleles. Artificial neural network predictions are given as actual IC50 values (Lundegaard C, 2008).


Stabilized matrix method is a publically available software package. Many processes in molecular biology involve the recognition of short sequences of nucleic-or amino acids, such as the binding of immunogenic peptides to major histocompatibility complex (MHC) molecules.  This method has been successfully applied to predicting peptide binding to MHC molecules, peptide transport by the transporter associated with antigen presentation (TAP) and proteasomal cleavage of protein sequences (Peters B and Sette A, 2005). Advantageous features of the package are: the output generated is easy to interpret, input and output are both quantitative, specific computational strategies to handle experimental noise are built in, the algorithm is designed to effectively handle bounded experimental data, experimental data from randomized peptide libraries and conventional peptides can easily be combined, and it is possible to incorporate pair interactions between positions of a sequence.


The combinatorial peptide libraries are a useful tool to characterize the binding specificity of class I MHC molecules. Compared to other methodologies, such as pool sequencing or measuring the affinities of individual peptides, utilizing positional scanning combinatorial libraries provides a baseline characterization of MHC molecular specificity that is cost effective, quantitative and unbiased. combinatorial library approach for describing MHC class I binding specificity and identifying high affinity binding peptides. These libraries were shown to be useful for identifying specific primary and secondary anchor positions, and thereby simpler motifs, analogous to those described by other approaches. (Sidney J, 2008).


Prediction of which peptides can bind major histocompatibility complex (MHC) molecules is commonly used to assist in the identification of T cell epitopes. Average relative binding matrix methods (ARB) that directly predict IC50 values allowing combination of searches involving different peptide sizes and alleles into a single global prediction.  MHC binding predictions based on ARB matrices were made available at web server (Bui HH,, 2005).

2.2.5 NetMHC pan

Binding of peptides to major histocompatibility complex (MHC) molecules is the single most selective step in the recognition of pathogens by the cellular immune system. The human MHC genomic region called human leukocyte antigen (HLA) is extremely polymorphic comprising several thousand alleles, each encoding a distinct MHC molecule. This method generates a quantitative prediction of affinity of peptide MHC (major histocompatibility complex) class I molecule. It trained to already available data such as HLA-A,HLA-B etc. NetMHCpan-2.0 method can accurately predict binding to uncharacterized HLA molecules. The method is available at (Hoof.I,, 2008).


The homology mapping is used to predict the structural features of epitope. It helps to analyze the epitope based on structure. Structural information about epitopes, particularly the three-dimensional (3D) structures of antigens in complex with immune receptors, presents a valuable source of data for immunology.  This information is available in the Protein Data Bank (PDB) and provided in curated form by the Immune Epitope Database and Analysis Resource (IEDB). With continued growth in these data and the importance in understanding molecular level interactions of immunological interest there is a need for new specialized molecular visualization and analysis tools. This epitope viewer is based on java application, is used to visualize the three dimensional structure, antigen specific receptor of immune system and epitope structure. It allows both two dimensional and three dimensional view of antigen and the receptor molecule. The Epitope Viewer can be accessed from the IEDB Web site through the quick link 'Browse Records by 3D Structure' (Beaver JE,, 2007).



Uniprot database

Protein sequence retrieved

ImmuneEpitope DataBase

T -cell epitope prediction

B- Cell epitope prediction

T cell epitope processing prediction

T cell epitopes -MHC binding prediction

Proteasome cleavage/TAP transport/MHC class I combined predictor

Peptide binding to MHC class I molecule

In this project we used IEDB( immune epitope database) for epitope prediction, uniprot for sequence retrieval, and AFND ( allele frequency net database)



UniProt stands for UNIversal PROtein resource, a database that provides information protein. It comprises the Eurpoean Bioinformatics Institute (EBI), Swiss Institute of Bioinformatics (SIB), Protein Information Resource(PIR). Thus, it's the combination of Swiss-Prot, TrEMBL and PIR. The UniProt provides four core databases, which are - UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters(UniRef), UniProt Metagenomics and Environmental Sequence database(UniMES) and UniProt Archive(UniParc). These databases provide information regarding the protein sequences and their functional annotation, the non-redundant reference data regarding the sequence space at several resolutions, about the environmental and metagenomic sequence data and the information about the sequences without redundancy.

Steps for sequence retrieval as follows

Go to UniProt site at

Give the protein name in to search tab.

Click on "search" button.

Select one hit and retrieve the sequence.

Save the sequence in FASTA format.

3.2 IEDB (Immune Epitope DataBase)

The IEDB contains data related to antibody and T cell epitopes for humans, non-human primates, rodents, and other animal species. Curation of peptidic and non-peptidic epitope data relating to all infectious diseases. In this database it provide different tools for predicting and analyzing the epitope, T cell epitope prediction tools, B cell epitope prediction tools, epitope analyzing tools. The Immune Epitope Database and Analysis Resource (IEDB, hosts a continuously growing set of immune epitope data curated from the literature, as well as data submitted directly by experimental scientists [4]. The IEDB provide a collection of prediction tools for both MHC class I and MHC class II binding predictions. And also it provides lots of methods for the analysis of the T cell epitope prediction results. In addition to the T cell epitope prediction methods it also provides B cell epitope prediction tools. The goal of the IEDB is to catalog and organize information related to T- and B-cell epitopes, as well as to provide tools to predict novel epitopes and to analyze known epitopes to gain new information about them [5]. The main feature of IEDB is its web site is updated continuously based on the user feedback [6]. Multiple applications can benefit from identifying T-cell epitopes, including the design of prophylactic vaccines [7], therapeutics [8], diagnostics [9], reagents for research [10]and de-immunization of biological drugs [11]. Designing vaccines against infectious agents is the most frequent application that can benefit from knowing T-cell epitopes. Basically, for a vaccine to induce the creation of a memory T-cell population capable of recognizing a pathogen, the vaccine has to contain T-cell epitopes from that pathogen. The IEDB provides a catalog of experimentally characterized T-cell epitopes, as well as data on Major Histocompatibility Complex (MHC) binding and MHC ligand [4]. It also contains tools to predict B cell epitopes based on different methods. It predicts epitopes based on sequences and structural information. The B cell epitope prediction includes all types of prediction based on surface accessibility, chain flexibility, physicochemical properties of amino acid residues, hydrophilic scale, and propensity scale method. It consists of different tools for predicting both continuous and linear epitopes.

Tools used in project are B cell epitope prediction tools and T cell epitope prediction tools. In B cell epitope prediction tools it contain linear epitope prediction tools and discotope prediction and ellipro- epitope prediction tools. In T cell epitope prediction tools it contain two different prediction tools based on different methods. T cell epitope MHC binding prediction, in this two type of peptide binding prediction are there one is based on MHC class I molecule and the other one is based on MHC class II molecule. And it also contain T cell epitope processing prediction, in this it include proteosomal cleavage /TAP transport/MHC class I binding predictor and neural network based prediction of proteasomal cleavage sites (NetChop) and T cell epitopes (NetCTL and NetCTLpan).

3.3 B cell epitope prediction tools

The tool predict region of proteins that are likely to be recognized as epitope in the context of a B cell response. In this three types of predictions are there. Prediction of linear epitopes from protein sequence, Discotope - Prediction of epitopes from protein structure, ElliPro - Epitope prediction based upon structural protrusion.

In linear epitope prediction, it include a collection of methods to predict continuous linear B cell epitopes by amino acid scales and discontinuous epitopes using protein 3D structures. This includes, Chou&fasman beta turn prediction, it predicts turns based on choufasman scale. Emini surface accessibility prediction,calculations are based on surface accessibility scale. If the value of residues is greater than 1 (>1) an increased probability of being found on the surface. Karplus & schulz flexibility prediction, it predict the chain flexibility of protein based on flexibility scale, the scale is based on mobility of protein segment depends on temperature. It act as tool for selection of peptide antigen. Kolaskar & Tongaonkar Antigenicity, semi empirical methods are used for the calculation. Which make use of physicochemical properties of amino acid residues and their frequencies of is used to predict the antigenic determinant on the protein, which is about 75% accurate. Parker Hydrophilicity Prediction, calculations are based on hydrophilic scale. B-epi-pred Linear Epitope Prediction it predict linear epitopes using a combination of HMM and propensity scale.

In discotope prediction of epitope from protein structure it predicts discontinuous epitope from the protein structure. This method involved in solvent-accessible surface area calculations, as well as contact distances into its prediction of B cell epitope potential along the length of a protein sequence. Discovery of discontinuous B-cell epitopes is a major challenge in vaccine design. There are many epitope prediction methods based on protein sequence these predictions are not much. The method for discontinuous epitope prediction that uses protein three-dimensional structural data. The calculations are based on amino acid residues, spatial information, and surface accessibility is determined by X-ray crystallography of antigen/antibody complex [16]. DiscoTope is the first method to focus explicitly on discontinuous epitopes. Discontinuous epitope is predicted using structural information. The structure based predictions are more accurate and perform well than the predictions based on, so it can successfully predict epitope residues that have been identified by different techniques. DiscoTope detects 15.5% of residues located in discontinuous epitopes with a specificity of 95%. At this level of specificity, the calculation based on Parker hydrophilicity scale for predicting linear B-cell epitopes, which identifies only 11.0% of residues being a part of discontinuous epitopes. DiscoTope which perform very well because the predictions are based on structure, so we can map the epitope for both rational vaccine design and for diagnostic tools development, and mapping may lead to more efficient epitope identification.

And in ElliPro - Epitope prediction based upon structural protrusion, this method predicts epitopes based upon solvent-accessibility and flexibility. The results from ElliPro suggest that further research on antibody epitopes considering more features that discriminate epitopes from non-epitopes may further improve predictions. As ElliPro is based on the geometrical properties of protein structure and does not require training, it might be more generally applied for predicting different types of protein-protein interactions. It predict linear and discontinuous epitopes based on a protein antigen's 3D accept sequence and structure as input, it also predict 3D structure by homology modeling. Each epitopes are predicted with a score PI(protrusion index).PI value is the % of the protein enclosed in an ellipsoid, which approximates the protein surface. The residues that are outside the 90% of ellipsoid will have PI=9, i.e. ellipro=0.9. If the input is in sequence format, it search the PDB database for 3D structural template using BLAST. The 3D structure is predicted using homology modeling.

Analyzing tools: it include Population Coverage, Epitope Conservancy Analysis , Epitope Cluster Analysis, and Homology Mapping . The Population Coverage   tool calculates the fraction of individuals predicted to respond to a given set of epitopes with known MHC restrictions. This calculation is made on the basis of HLA genotypic frequencies assuming non-linkage disequilibrium between HLA loci. The Epitope Conservancy Analysis ,this tool calculates the degree of conservancy of an epitope within a given protein sequence set at different degrees of sequence identity. The degree of conservation is defined as the fraction of protein sequences containing the epitope at a given identity level. In Epitope Cluster Analysis this tool groups epitopes into clusters based on sequence identity. A cluster is defined as a group of sequences which have a sequence similarity greater than the minimum sequence identity threshold specified. And in Homology Mapping this tool maps linear and conformational epitopes to 3D structures of proteins. This is done by comparing the epitope source protein sequence with that of proteins with known 3D structures in the PDB. The tool generates an alignment between the query sequence of the epitope source sequence and a homologous sequence from the PDB, and allows visualizing the result in an EpitopeViewer.

3.4 T cell epitope prediction tools

In this there are two types of prediction based on different methods include artificial neural network (ANN) [12], stabilized matrix method (SMM), NetMHCpan. Using these methods we can identify the peptides having multiple alleles. And also we can predict the proteasome cleavage site, TAP (transport associated with antigen presentation) transport, MHC binding, processing score, and total score[14]. In T cell epitope prediction there are two types of prediction based on different criteria; one is,

T cell - MHC binding prediction: These tools predict IC50 values for peptides binding to specific MHC molecules. The binding affinity to MHC is necessary but not sufficient for recognition by T cells. T-cell epitope MHC binding prediction predict both peptide binding to MHC class I molecules and MHC class II molecules. Peptide binding to MHC class I molecules, this tool will take in an amino acid sequence, or set of sequences and determine each subsequence's ability to bind to a specific MHC class I molecule. Peptide binding to MHC class II molecules, this tool employs a consensus approach to predict MHC Class II epitopes based upon Sturniolo, ARB, and SMM_ align. Another one is,

T Cell Epitopes - Processing Prediction: These tools predict epitope candidates based upon the processing of peptides in the cell. It includes proteasome cleavage/TAP transport/MHC class I combined predictor, Neural network based prediction of proteasomal cleavage sites (NetChop) and T cell epitopes (NetCTL and NetCTLpan). In Proteasomal cleavage/TAP transport/MHC class I combined predictor, this tool combines predictors of proteasomal processing, TAP transport, and MHC binding to produce an overall score for each peptide's intrinsic potential of being a T cell epitope. Neural network based prediction of proteasomal cleavage sites (NetChop) and T cell epitopes (NetCTL and NetCTLpan), NetChop is a predictor of proteasomal processing based upon a neural network[12]. NetCTL and NetCTLpan are predictors of T cell epitopes along a protein sequence. It also employs neural network architecture. From these methods the results are analyzed based on the IC50 value (half maximal inhibitory concentration. The predicted output is given in unit of IC50nm. There for lower number indicates higher affinity. IC50 less than 50nm is considered as high affinity,IC50 less than 500nm is considered as intermediadted affinity, and less than 5000nm is considered as low affinity.

3.5 AFND (allele frequency net database)

AFND is a database and online repository for immune gene frequencies in worldwide populations. It is a freely available database to all, for the storage of allele frequencies from different polymorphic area in the human genome. It contain data including alleles, haplotype and genotype format. So the user can perform database searches on information already available and also can contribute the work results into one common database. Currently the database contains a lots of information, i.e. 88,285 (HLA), 4,959 (KIR), 3,603(Cytokine) and 723 (MIC) from 4,349,169 individuals. This database is compiled to execute the frequency of alleles in different polymorphic regions of different populations in the field of histocompatibility and immunogenetics.


PDB stands for Protein Data Bank is a database that provides a 3-D structural data for a large number of biologically important molecules- proteins and nucleic acids. These structures are obtained by the data submitted by the researchers using X-ray crystallography and NMR spectroscopy. Worldwide Protein Data Bank (wwPDB) takes care of PDB. PDB entries are categorized based on the type of the structure, on gene and also their assumed evolutionary relations. It a resource that provides key information regarding the structural genomics.

3.7 Swiss PDB viewer

Swiss PDB viewer was created by Nicolas Guex. It analyzes several protein at the same time. Swiss-pdb viewer is used to find the active sites of the protein and is uased to visualize the 3D structures of the proteins. It can find the amino acid mutation, H-bonds of the atoms, angle of the atoms, and distance between the atoms. Swiss-pdb viewer is linked with swiss model. It also supports homology modeling of the proteins. Swiss-pdb viewer supports pdb file formats. It can read the electron density maps, and also implicated with command files for energy minimization.


T cell epitope prediction

MHC I binding prediction, It predict the peptides binds to MHC class I protein. Based on earlier predictions and performance this tool tries to use the best possible methods for a given MHC molecule. Currently for peptide: MHC-I binding prediction, for a given MHC molecule, IEDB uses the Consensus method consisting of NetMHC, SMM, ANN, and CombLib and check whether these methods are suitable for MHC molecules. Otherwise, NetMHCpan is used. This choice was motivated by the expected predictive performance of the methods in decreasing order: Consensus > NetMHC > SMM > NetMHCpan > CombLib. Large scale evaluation of MHC I binding predictions found that the overall best method is (ANN)[12], it predict alleles and peptide length combinations[12]. The new method is shown to have a performance that is higher than that of other methods. By use of mutual information calculations found that the peptides binds to HLA*0204 displays higher order sequence correlations [12]. Neural networks are ideally suited to integrate such higher order correlations when predicting the binding affinity. It is this feature combined with the use of several neural networks derived from different and novel sequence-encoding schemes and the ability of the neural network to be trained on data consisting of continuous binding affinities that gives the new method an improved performance. The difference in predictive performance between the neural network methods and that of the matrix-driven methods is found to be most significant for peptides that bind strongly to the HLA molecule, confirming that the signal of higher order sequence correlation is most strongly present in high-binding peptides[12]. The results obtained from this method are analyzed based on the IC50 values. From this we can find that the lower IC50 value gives higher affinity. The SMM (i.e. Stabilized Matrix Method) it predict peptide binding associated with MHC molecules, peptide transporter associated with antigen presentation (TAP) and proteasome cleavage of protein sequences[13].the TAP deliver cytosolic peptides into endoplasmic reticulum(ER)[15]. The structure of TAP is formed by two proteins, TAP1, TAP2, one is hydrophobic and the other one is ATP binding region [15]. The ARB it Prediction of which peptides can bind major histocompatibility complex (MHC) molecules is commonly used to assist in the identification of T cell epitopes. However, because of the large numbers of different MHC molecules of interest, each associated with different predictive tools, tool generation and evaluation can be a very resource intensive task [14]. The commonly used methodology to predict MHC binding affinity is the matrix or linear coefficients method. Average Relative Binding (ARB) matrix methods that directly predict IC50 values allowing combination of searches involving different peptide sizes and alleles into a single global prediction. A computer program was developed to automate the generation and evaluation of ARB predictive tools [14].



The UniProt retrieved sequence was subjected to various analysis using IEDB (immune epitope database).