Insilico Protein Analysis And Design Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The protein databases can be broadly classified into three categories namely the primary databases, secondary databases and structure databases. The primary structure of a protein consists of its amino acid sequence; these can be stored in primary databases as linear array of alphabets that denote the constituent amino acid residues. The secondary structure of a protein represents the regions of local regularity (e.g., α-helices and β-strands), when sequentially aligned, are often apparent as well conserved motifs; these are stored in secondary databases as patterns (e.g., blocks, regular expressions, profiles, fingerprints etc) The tertiary structure of a protein arises as the result of packing its secondary structure elements, which may form discrete domains within a fold (a,b,c), or it might give rise to independent units of folds or modules; complete folds, domains and these modules are stored in databases called structural databases as sets of atomic co-ordinates.

Primary sequence databases

In the early 1980s, sequence information started to become more abundant in the scientific literature. Recognizing this, various laboratories saw that they might be in advantageous position if they harvest and store these sequences in central repositories. Hence, many primary database projects began to develop in various places around the globe. The databanks are described briefly below.

Primary nucleic acid and protein sequence databases.

Nucleic add Protein









Nucleic add sequence databases

The prime DNA sequence databases are GenBank (USA), EMBL (Europe) and DDBJ (Japan), which exchange data on a day to day basis to ensure comprehensive coverage of data in their websites.

Protein sequence databases


Margaret Dayhoff developed this Protein Sequence Database at the National Biomedical Research Foundation (NBRF) in the early 1960s, for investigating evolutionary relationships among proteins. Since 1988, the Protein Sequence Database has been maintained collaboratively by PIR-International, an association of macromolecular sequence data collection centers: the consortium includes the Protein Information Resource (PIR) at the NBRF, the International Protein Information Database of Japan (JIPID), and the Martinsried Institute for Protein Sequences (MIPS).

Currently this database is split into four distinct sections, designated PIR1-PIR4, which differ in terms of the quality of data and level of annotation provided: PIR1 contains fully annotated and classified entries; PIR2 includes preliminary entries, which are not thoroughly reviewed and might contain some redundancy; PIR3 consists of unverified entries, which have not been reviewed; and PIR4 entries fall into one of these four categories: (i) conceptual translations of artefactual sequences; (ii) conceptual translations of sequences that are not transcribed or translated; (Hi) protein sequences or conceptual translations that are extensively genetically engineered; or (iv) sequences that are not genetically encoded and not produced by ribosomes. Programs are provided for data retrieval and sequence searching via the NBRF-PIR database Web page.



The Martinsried Institute for Protein Sequences will collect and process the sequence data for the tripartite PIR-International Protein Sequence Database project (Mewes et at, 1998). This database is distributed with PATCHX, it's a supplement of an unverified protein sequence from external sources. Access to this database is provided through its Web server: the results of FastA similarity searches of all proteins amongst PIR-International and PATCHX are stored in a dynamically maintained database, instantly giving access to all FastA results.


It is a protein sequence database which was incorporated in 1986, collaboratively by the EMBL and the Department of Medical Biochemistry at the University of Geneva; after 1994, the collaboration moved to European Molecular Biology Laboratory's UK outstation, the EBI (Bairoch and Apweiler, 1998). In April 1998, further change saw a move to the Swiss Institute of Bioinformatics (SIB); thereafter the database is now maintained collaboratively by SIB and EBI/EMBL. The database endeavors to provide high-level annotations, including descriptions of the function of the protein, and of the structure of its domains, variants, its post-translational modifications, and so on. SWISS-PROT aims to be minimally redundant, and is interlinked to many other resources. In 1996, a computer-annotated supplement to SWISS-PROT was created, termed TrEMBL, which is also described in more detail below. First, we will take a close look at the structure of SWISS-PROT entries.

The structure of SWISS-PROT entries

The structure of the database, and the quality of its annotations, sets SWISS-PROT apart from other protein sequence resources and has made it the database of preference for most of the research purposes. By mid-1998, the database contained -70000 entries from more than 5000 different species, the bulk of these coming from just a small number of model organisms (e.g., Homo sapiens, Saccharomyces cerevisiae, Escherichia coli, Mus musculus and Rattus norvegicus). An example entry is shown in Figure 3.1. Each line is flagged with a two-letter code, which helps to present the information in a structured way. Entries begin with an identification (ID) line and end with a // terminator. Here, the ID line informs us that the entry name is OPSD_SHEEP, a pro- protein with 348 amino acids. ID codes in SWISS-PROT have been designed to be informative and people-friendly; they take the form PROTEIN. SOURCE, where the PROTEIN part of the code is an acronym that denotes the type of protein, and SOURCE indicates the organism name. The protein in this example is clearly derived from sheep and, with the eye of experience, we can deduce that it is a rhodopsin. Unfortunately, ID codes can sometimes change, so an additional identifier, an accession number, is also provided, which ought to remain static between database releases. The accession number is provided on the AC line, here P02700, which, although relatively uninformative to the human user, is nevertheless computer readable. If several numbers appear on the same AC line, the first, or primary, accession number is the most current. Next, the DT lines provide information about the date of entry of the sequence to the database, and details of when it was last modified. The description (DE) line, or lines, then informs us of the name, or names, by which the protein is known - here simply rhodopsin. The following lines give the gene name (GN), organism species (OS) and organism classification (ОС) within the biological kingdoms. The next section of the database provides a list of supporting references; these can be from the literature, unpublished information submitted directly from sequencing projects, data from structural or mutagenesis studies, and so on. The database is thus an important repository of information that is difficult, or impossible, to find elsewhere.

Following the references are found comment (CC) lines. These are divided into themes, which tell us about the FUNCTION of the protein, its post-translational modifications (PTM), its TISSUE SPECIFICITY, SUB- CELLULAR LOCATION, and so on. Where such information is available, the CC lines also indicate any known SIMILARITY or affiliation to particular protein families. In this example, we learn that rhodopsin is an integral membrane 'visual' protein found in rod cells; it belongs to the opsin family and to the type 1 G-protein-coupled receptor (GPCR) superfamily. Database cross-reference (DR) lines follow the comment field. These provide links to other biomolecular databases, including primary sources, secondary databases, specialist databases, etc. For ovine rhodopsin, we find links to the primary PIR source, to the GPCR specialist database, to the PROSITE secondary database and to the ProDom domain database. Directly after the DR lines is found a list of relevant keywords (KW), and then a number of FT lines, which form what is known as a Feature Table. The Feature Table highlights regions of interest in the sequence, including local secondary structure (such as transmembrane domains, as seen in the figure), ligand binding sites, post-translational modifications, and so on. Each line includes a key (e.g., TRANSMEM), the location in the sequence of the feature (e.g., 37-61), and a comment, which might, for example, indicate the level of confidence of a particular annotation (e.g., POTENTIAL). For our rhodopsin example, the transmembrane domain assignments result from the application of prediction software, and, therefore, in the absence of supporting experimental 3D structural data, can only be flagged as potential.

The final section of the database entry includes the sequence itself, on the SQ lines. For efficiency of storage, the single-letter amino acid code is used, each line containing 60 residues. Sequence data in SWISS-PROT correspond to the precursor form of the protein, before post-translational processing, hence information concerning the size or molecular weight will not necessarily correspond to values for the mature protein. The extent of mature proteins or peptides may be deduced by reference to the Feature Table, which will indicate the region of a sequence that the CC lines also indicate any known SIMILARITY or affiliation to particular protein families. In this example, we learn that rhodopsin is an integral membrane 'visual' protein found in rod cells; it belongs to the opsin family and to the type 1 G-protein-coupled receptor (GPCR) superfamily. Database cross-reference (DR) lines follow the comment field. These provide links to other biomolecular databases, including primary sources, secondary databases, specialist databases, etc. For ovine rhodopsin, we find links to the primary PIR source, to the GPCR specialist database, to the PROSITE secondary database and to the ProDom domain database. Directly after the DR lines is found a list of relevant keywords (KW), and then a number of FT lines, which form what is known as a Feature Table. The Feature Table highlights regions of interest in the sequence, including local secondary structure (such as transmembrane domains, as seen in the figure), ligand binding sites, post-translational modifications, and so on. Each line includes a key (e.g., TRANSMEM), the location in the sequence of the feature (e.g., 37-61), and a comment, which might, for example, indicate the level of confidence of a particular annotation (e.g., POTENTIAL). For our rhodopsin example, the transmembrane domain assignments result from the application of prediction software, and, therefore, in the absence of supporting experimental 3D structural data, can only be flagged as potential.

The final section of the database entry includes the sequence itself, on the SQ lines. For efficiency of storage, the single-letter amino acid code is used, each line containing 60 residues. Sequence data in SWISS-PROT correspond to the precursor form of the protein, before post-translational processing, hence information concerning the size or molecular weight will not necessarily correspond to values for the mature protein.

The structure of SWISS-PROT makes computational access to the different information fields both straightforward and efficient - for example, query software need not search the full flat-file, but can be directed to those lines that are specific to the nature of the query. For this reason, coupled with the quality of its biological annotations, SWISS-PROT has become probably the most widely used protein sequence database in the world.



TrEMBL (Translated EMBL) was created in 1996 as a computer-annotated supplement to SWISS-PROT (Bairoch and Apweiler, 1998). The database benefits from the SWISS-PROT format, and contains translations of all coding sequences (CDS) in EMBL. TrEMBL has two main sections, designated SP-TrEMBL and REM-TrEMBL: SP-TrEMBL (SWISS-PROT TrEMBL) contains entries that will eventually be incorporated into SWISS- PROT, but that have not yet been manually annotated; REM-TrEMBL contains sequences that are not destined to be included in SWISS-PROT - these include immunoglobulins and T-cell receptors, fragments of fewer than eight amino acids, synthetic sequences, patented sequences, and codon translations that do not encode real proteins. TrEMBL was designed to address the need for a well-structured SWISS-PROT-like resource that would allow very rapid access to sequence data from the genome projects, without having to compromise the quality of SWISS-PROT itself by incorporating sequences with insufficient analysis and annotation.



The NRL-3D database is produced by PIR from sequences extracted from the Brookhaven Protein Databank (PDB). The titles and biological sources of the entries conform to the nomenclature standards used in the PIR. Bibliographic references and MEDLINE cross-references are included, together with secondary structure, active site, binding site and modified site annotations, and details of experimental method, resolution, R-factor, etc. Keywords are also provided. NRL-3D is a valuable resource, as it makes the sequence information in the PDB available both for keyword interrogation and for similarity searches. The database may be searched using the ATLAS retrieval system, a multi-database information retrieval program specifically designed to access macromolecular sequence databases.


Composite protein sequence databases

One resolution to the problems faced in the process man folding of primary databases is to compile a composite, i.e. a database that commixes a variety of primary sources. Composite databases can deliver much more efficiency in sequence searching, because they eliminate the need to question multiple resources. The interrogation process is narrowed down further if the composite database has been designed to be non-redundant. Different strategies are employed to create composite resources. The final product depends much on the chosen data sources and the criteria used in merging them; for example, a composite resource shall be non-identical if it eliminates only identical sequence copies during the merging process; but if both identical and highly similar sequences are turfed out (e.g., those entries that differ only by one residue, such as a leading methionine residue), then the ensuing database will be more truly non-redundant. The choice of different sources and varying application of different redundancy criteria have led to the emergence of varying composites, each having its own unique format. The prime composite databases are outlined below.


NRDB (Non-Redundant DataBase) is built locally at the NCBI. The database is a composite of), SWISS-PROT, PDB sequences, SPupdate (the weekly updates of SWISS-PROT), GenPept (derived from automatic GenBank CDS translations, PIR and GenPeptupdate (the daily updates of GenPept). The database is thus comprehensive and contains up-to-date information. Strictly speaking, it is not non-redundant, but non-identical, i.e., only identical sequence copies are removed from the resource. This rather simplistic approach leads to a number of problems: multiple copies of the same protein are retained in the database as a result of polymorphisms and/or minor sequencing errors; incorrect sequences that have been amended in SWISS- PROT are reintroduced when retranslated from the DNA; and numerous sequences are incorporated as full entries of existing fragments. As a result, the contents of NRDB are both error-prone and, in spite of its name, redundant. NRDB is the default database of the NCBI BLAST service.



OWL is a non-redundant protein sequence database built at the University of Leeds in collaboration with the Daresbury Laboratory in Warrington. The database is a composite of four major primary sources: SWISS-PROT, PIR1-4, GenBank (CDS translations) and NRL-3D. The sources are assigned a priority with regard to their level of annotation and sequence validation; SWISS-PROT has the highest priority, so all others are compared against it during the amalgamation procedure. This process eliminates both identical copies of sequences and those containing single amino acid differences, leading to a compact (and efficient) resource for sequence comparisons. Nevertheless, the database suffers from many of the same problems as NRDB, which means that some sequencing errors and retranslations of incorrect sequences in GenBank are retained; and since OWL is only released on a 6-8 weekly basis, it suffers the further drawback of not being up-to-date. BLAST services for OWL are available from the UK EMBnet National Node, SEQNET, and from the UCL Specialist Node.



MIPSX is a merged database produced at the Max-Planck Institut in Martinsried (Mewes et al., 1998). The database contains information from the following resources: PIR1-4; MIPS preliminary entries, MlPSOwn; MIPS/PIR preliminary entries, PIRMOD; MIPS preliminary translations, MIPSTrn; MIPS yeast entries, MIPSH; NRL-3D; SWISS-PROT; EMTrans, an automatic translation of EMBL; GBTrans, translated GenBank entries; Kabat; and PSeqIP. The sources are assigned a priority as denoted by their order, and sequences that are identical either within or between them are removed, leaving only unique copies. In addition, all subsequences (i.e., sequences completely contained within others) are removed.


At the EBI, the combination of SWISS-PROT and TrEMBL provides a resource that is both comprehensive and 'minimally' redundant. This database has the advantage of containing fewer errors than do those mentioned above, yet it is still not truly non-redundant (in mid-1997, it was estimated that around 30% of the combined total of SWISS-PROT and TrEMBL was non-unique). To reduce error rates and redundancy levels further will require increasing levels of human intervention and/or the future development of expert database management systems. SWISS-PROT and TrEMBL can be searched by means of the SRS sequence retrieval system on the EBI Web server.

Secondary databases

In addition to the numerous primary and composite resources, there are many secondary (or pattern) databases, so-called because they contain the fruits of analyses of the sequences in the primary sources. Because there are several different primary databases, and a variety of ways of analysing protein sequences, the information housed in each of the secondary resources is different - and their formats reflect these disparities. Designing software tools that can search the different types of data, interpret the range of outputs, and assess the biological significance of the results is not a trivial task. Although this appears to present the usual confusing picture, where nothing is consistent and there are no standards, SWISS-PROT has emerged as the most popular primary source, and many secondary databases now use it as their basis.

Why create secondary databases?

The type of information stored in each of the secondary databases is different. Yet these resources have arisen from a common principle: namely, that homologous sequences may be gathered together in multiple alignments, within which are conserved regions that show little or no variation between the constituent sequences. These conserved regions, or motifs, usually reflect some vital biological role (i.e., are somehow crucial to the structure or function of the protein). Motifs have been exploited in different ways to build diagnostic patterns for particular protein families. The idea is that an unknown query sequence may be searched against a library of such patterns to determine whether or not it contains any of the predefined characteristics, and hence whether or not it can be assigned to a known family. If the structure and function of the family are known, searches of pattern databases thus offer a fast track to the inference of biological function. Because pattern databases are derived from multiple sequence information, searches of them are often better able to identify distant relationships than are corresponding searches of the primary databases. However, none of the pattern databases is yet complete; they should therefore only be used to augment primary database searches, rather than to replace them.


The first secondary database to have been developed was PROSITE, which is now maintained collaboratively at the Swiss Institute of Bioinformatics. The rationale behind its development was that protein families could be simply and effectively characterised by the single most conserved motif observable in a multiple alignment of known homologues, such motifs usually encoding key biological functions (e.g., enzyme active sites, ligand or metal binding sites, etc.). Searching such a database should, in principle, help to determine to which family of proteins a new sequence might belong, or which domain(s) or functional site(s) it might contain. Within PROSITE, motifs are encoded as regular expressions, often simply referred to as patterns. The process used to derive patterns involves the construction of a multiple alignment and manual inspection to identify conserved regions. Sequence information within individual motifs is reduced to single consensus expressions, and the resulting seed patterns are used to search SWISS-PROT. Results are checked manually to determine how well the patterns have performed: ideally, there should be only correct matches (so-called true-positives), and no incorrect matches (false-positives).



From inspection of sequence alignments, it is clear that most protein families are characterised not by one, but by several conserved motifs. It therefore makes sense to use many, or all, of these to build diagnostic signatures of family membership. This is the principle behind the development of the PRINTS fingerprint database, which until 1999 was maintained in the Department of Biochemistry and Molecular Biology at University College London (UCL). Fingerprints inherently offer improved diagnostic reliability over single-motif methods by virtue of the mutual context provided by motif neighbours: in other words, if a query sequence fails to match all the motifs in a given fingerprint, the pattern of matches formed by the remaining motifs still allows the user to make a reasonably confident diagnosis. Within PRINTS, motifs are encoded as ungapped, unweighted local alignments. The process used to derive fingerprints differs markedly from that used to create regular expressions. Here, sequence information in a set of seed motifs is augmented through a process of iterative (composite) database scanning. In brief, from a small initial multiple alignment, conserved motifs are identified and excised manually for database searching (PRINTS is currently derived from scans of OWL, but future releases will be built from searches of SWISS-PROT + SP-TrEMBL). Results are examined to determine which sequences have matched all the motifs within the fingerprint; if there are more matches than were in the initial alignment, the additional information from these new sequences is added to the motifs, and the database is searched again. This iterative process is repeated until no further complete fingerprint matches can be identified. The results are then annotated for inclusion in the database. At the top of the file, each fingerprint is given an identifying code(usually an acronym that attempts to describe e family), and a title that gives the family name - the fingerprint, or signature, for the opsins is identified by the code OPSIN. Prior to the date line, which indicates when the entry was added to the database and when it was last updated, a number of database cross-links are provided, allowing users to access additional information about the family in related biological resources.

Where possible, the description includes details of the structural and/or functional relevance of the conserved motifs. In the second section of the PRINTS entry, is found information relating to the diagnostic performance both of the fingerprint as a whole and of its constituent motifs. First, a summary lists how many sequences matched all the motifs and how many made partial matches (i.e., failed to match one or more motifs). The table that follows provides additional information in support of these results, detailing how many sequences were matched by each individual motif - here, the important information gained is that the reported partial hit failed to match

In the final part of the entry, the seed motifs used to generate the fingerprint are listed, followed by the final motifs (not shown) that result from the iterative database scanning procedure. Each motif is identified by its parent ID code and a number that indicates which component of the fingerprint it is. The three motifs in the OPSIN fingerprint are designated OPSIN1, OPSIN2 and OPSIN3. After the code, the motif length is given, followed by a short description, which indicates the relevant iteration number (for the initial motifs, of course, this will be 1 always). The aligned motifs are then provided; together with the corresponding source database ID code of each of the constituent sequence fragments (here only sequences from SWISS-PROT were included in the initial alignment). The location of target sequence in the parent sequence of each fragment is then given, together with the interval (i.e., the number of residues) between the fragment and its preceding neighbor - for the first motif, this value is the distance from the N-terminus.

Unlike with regular expressions or other such abstractions, no sequence information is lost here which is an important consequence of storing the motifs in this 'raw' form. This means that a variety of different scoring methods may be laid over onto the motifs, providing different scoring potentials for different perspectives on the same underlying data. PRINTS may provide the raw material for automatically derived tertiary databases.

The database is accessible for keyword and sequence searching through the Bioinformatics Web server, which in 1999 will have relocated from UCL to the University of Manchester. PROSITE and PRINTS are set apart from other secondary databases, which help to place conserved sequence information in a structural or functional context. This is vital for the end user, who needs to understand its biological significance. The following sections briefly describe some related secondary and tertiary databases that are generated using more automated procedures and provide little or no family annotation. Some of these use PRINTS and PROSITE as their data sources.



The analytic limitations of regular expressions led to the creation of a multiple-motif database, based on protein families in PROSITE, at the Fred Hutchinson Cancer Research Center (FHCRC) in Seattle; this is the BLOCKS database. In this resource, blocks are created by automatically detecting the most highly conserved regions of each protein family which is achieved through a method based on the identification of three conserved amino acids (which need not be contiguous in sequence). The resulting blocks, which are ultimately encoded as ungapped local alignments, are calibrated against SWISS-PROT to obtain a measure of the likelihood of a chance match.

Two scores are noted for each block: the first denotes the level at which 99.5% of matches are true-negatives; the second is the median value of the true-positive scores, for the purpose of comparing the diagnostic performance of individual blocks. The median standardised score for known true-positive matches is termed strength.

The structure of the database entry is compatible with that used in PROSITE, where each block is identified by a general code, referred to as the ID line and an accession number, which takes the form BL00000X

-(X is a letter that specifies which the block is within the family's set of blocks, e.g., BL00327C is the third bacterial rhodopsin block).

Similarly, the ID line indicates the type of discriminator to expect in the file - here, not surprisingly, the word BLOCK tells us to expect a block. The AC line also provides an indication of the minimum and maximum distances of the block from its preceding neighbor or from the N-terminus if it is the first in a set of blocks. A title, or description of the family, is contained in the DE line. This is followed by the BL line, which provides an indication of the diagnostic power and some physical details of the block: these include the amino acid triplet (here R-Y-A), the width of the block and the number of sequences it contains, the 99.5%-level score, and finally the strength. Strong blocks are more effective than weak blocks (strength less than 1100) at separating true-positives from true-negatives. Following information comes from the block itself, which indicates the SWISS-PROT IDs of the constituent sequences, the start position of the fragment, the sequence fragment itself, and a score, or weight, that provides a measure of the closeness of the relationship of that sequence to others in the block A00 being the most distant. Sequence fragments that are less than 80% similar are separated by blank lines. Because the database is derived by fully automatic methods, but links are made to the corresponding PROSITE family documentation file. The database is accessible for keyword and "sequence searching using the Blocks Web server at the FHCRC.


In addition to the BLOCKS database, the FHCRC Web server provides a version of the PRINTS database in BLOCKS format. In this resource, the scoring methods that underlie the derivation of blocks have been applied to each of the aligned motifs in PRINTS. The structure of the entry is identical to that used in BLOCKS, with only minor differences. On the AC line, the PRINTS accession number is given, with an appended letter to indicate which component of the fingerprint it is. On the BL line, the triplet information is replaced by the word 'adapted', indicating that the motifs have been taken from another database.

Because BLOCKS based PRINTS is derived automatically from PRINTS, its blocks are not annotated. Nevertheless, family and motif documentation may be accessed through links to the corresponding PRINTS entry. The database is accessible for keyword and sequence searching with the Blocks Web server at the FHCRC. A further important consequence of the direct derivation of the BLOCKS databases from PROSITE and PRINTS is that there is no further coverage of phylogeny. It is always advisable to search in both, PRINTS and BLOCKS, as the resources may be from either of the two. Still more, -50% of families encoded in PRINTS are not represented in PROSITE, so searches of both BLOCKS databases will be more comprehensive than searches of either resource alone.


An alternative philosophy to the motif-based approach of protein family characterization adopts the principle that the variable regions between conserved motifs also contain valuable sequence information. Here, the complete sequence alignment effectively becomes the discriminator. The discriminator, termed as profile, is weighted to indicate where insertions and deletions (INDELs) are allowed, what types of residues are allowed at what positions, and where the most conserved regions are. Profiles (alternatively known as weight matrices) provide a sensitive means of detecting distant sequence relationships, where only very few residues are well conserved - in these circumstances, regular expressions cannot provide good discrimination, and will either miss too many true-positives or catch too many false ones. The limitations of regular expressions in identifying distant homologues led to the creation of a compendium of profiles at the Swiss Institute for Experimental Cancer Research (ISREC) in Lausanne. Each profile has separate data and family-annotation files whose formats are compatible with PROSITE data and documentation files. This allows results that have been annotated to a standard to be made available as an integral part of PROSITE

The structure of PROSITE profile entries

The structure of the file is based on that of PROSITE, but with apparent differences. The first change is seen on the ID line, where the word MATRIX indicates that the type of discriminator to expect is a profile. Pattern (PA) lines are replaced by matrix (MA) lines, which list the various parameter specifications used to derive and describe the profile: they include details of the alphabet used (i.e., whether nucleic acid {ACGT} or amino acid {ABCDEFGHIKLMNPQRSTVWYZ}), the length of the profile, cut-off scores (which are designed, as far as possible, to exclude random matches), and so on. The I and M fields contain position-specific profile scores for insert and match positions respectively. Profiles that have not achieved the standard of annotation necessary for inclusion in PROSITE are nevertheless made available for searching via the ISREC Web server.


Just as there are different ways of using motifs to characterize protein families (e.g., depending on the scoring scheme used), so there are different methods of using full sequence alignments to build family discriminators. An alternative to the use of profiles is to encode alignments in the form of Hidden Markov Models» (HMMs). These are statistically based mathematical treatments, consisting of linear chains of match, delete or insert states that attempt to encode the sequence conservation within aligned families. A collection of HMMs for a range of protein domains is provided by the Pfam database, which is maintained at the Sanger Centre. The database is based on two distinct classes of alignment: hand-edited seed alignments, which are deemed to be accurate (these are used to produce Pfam-A); and those derived by automatic clustering of SWISS-PROT, which are less reliable (these give rise to Pfam-B). The high-quality seed alignments are used to build HMMs, to which sequences are automatically aligned to generate final full alignments. If the initial alignments do not produce HMMs with good diagnostics, the seed is improved and the gathering process is iterated until a good result is achieved. The methods that ultimately generate the best full alignment may vary for different families. So the parameters are saved in order that the result can be reproduced. The collection of seed and full alignments, coupled with minimal annotations, database and literature cross-references, and the HMMs themselves, constitute Pfam-A. All sequence domains that are not included in Pfam-A are automatically clustered and deposited in Pfam-B.

The format is compatible with PROSITE, each entry being identified by both an accession (AC) number (which takes the form PF00000) and an ID code (a single keyword). DE lines provide the title, or description, of the family, and AU lines indicate the author of the entry. The methods used to create both the seed and the full automatic alignments are noted on AL and AM lines respectively. The source database suggesting that seed members belong to one family, appropriate database cross-references, and the search program and cut-off used to build the full alignment are given in the SE, DR and GA lines. Although entries in Pfam-A have an annotation file available (which may contain details of the method, a description of the domain, and links to other databases), extensive family annotations are not yet in place.

Pfam is accessible for sequence searching via the Web server at the Sanger Centre on the Hinxton Genome Campus.



Another automatically derived tertiary resource, derived from BLOCKS and PRINTS, is IDENTIFY, which is produced in the Department of Biochemistry at Stanford University . The program used to generate this resource, eMOTIF, is based on the generation of consensus expressions from conserved regions of sequence alignments. However, rather than encoding the exact information observed at each position in an alignment (or motif), eMOTIF adopts a 'fuzzy' approach in which alternative residues are tolerated according to a set of prescribed groupings. These groups correspond to various biochemical properties, such as charge and size, theoretically ensuring that the resulting motifs have sensible biochemical interpretations.

Although this technique is designed to be more flexible than exact regular expression matching, its inherent permissiveness brings with it an inevitable signal-to-noise trade-off: i.e., the resulting patterns not only have the potential to make more true-positive matches, but they will consequently also match more false-positives. However, when using the resource for sequence searching, different levels of stringency are offered from which to infer the significance of matches. IDENTIFY and its search software, eMOTIF, are accessible for use via the protein function Web server from the Biochemistry Department at Stanford.

While there is some overlap between them, the contents of the PROSITE, PRINTS, profiles and Pfam databases are different. In 1998, together they encode -1500 protein families, covering a range of globular and membrane proteins, modular polypeptides, and so on. It has been estimated that the total number of protein families might be in the range 1000 to 10000, so there is still a long way to go before any of the secondary databases can be considered to be complete. Thus, in building a search strategy, it is good practice to include all available secondary resources, to ensure both that the analysis is as comprehensive as possible and that it takes advantage of a variety of search methods.

Composite protein pattern databases

Nowadays secondary database searching will certainly become more straightforward. The conservators of PROSITE, Profiles, PRINTS and Pfam are now co-operating with a view to creating a non-varying database of protein families. The aim is to provide a single, central family annotation resource in Geneva (based on existing documentation in PROSITE and PRINTS), each entry in which will point to different discriminators in the parent PROSITE, Profiles, PRINTS or Pfam databases. This will simplify sequence analysis for the user, who will thereby have access to a one-stop-shop for protein family analysis.

This effort is also supported by the conservators of the BLOCKS databases, who, realizing the problems due to providing detailed family documentation, are developing a dedicated protein family Web site, termed pro Web. This facility provides information about individual families through hyperlinks to existing Web resources that are maintained by researchers in their own fields. The curators of proWeb see its primary utility as being similar to that of written reviews, but with the advantage that it can be readily updated and can include. ProWeb will greatly facilitate the task of secondary database annotators, by providing convenient getting to family information and avoiding the need for annotators themselves to become 'expert' on all proteins.

Structure classification databases

A chapter concerning the repertoire of biological databases that may be used to aid sequence analysis would not be complete without the consideration of protein structure classification resources. Of course, these are currently limited to the relatively few 3D structures available from crystallographic and spectroscopic studies, but their impact will always increase as more structures become commitable.

Many proteins share structural similarities, reflecting, in some cases, common evolutionary origins. The evolutionary process involves substitutions, insertions and deletions in amino acid sequences. For distantly related proteins, such changes can be extensive, yielding folds in which the numbers and orientations of secondary structures vary considerably.

However, the structural environments of critical active site residues are also conserved. In an attempt to better understand sequence/structure relationships and the underlying phylogenetic processes that give rise to different fold families, a variety of structure classification schemes have been produced. The nature of the information presented by a structural classification scheme is entirely non-independent on the underlying philosophy of the approach, and hence on the methods used to identify and assess structural similarity. Structural families derived, for example, using algorithms that search and cluster on the basis of common motifs will be variable from those generated by procedures based on global structure evaluation; and the results of such automatic procedures will differ again from those based on visual inspection, where software tools are used essentially to render the task of classification more manageable.

Two well-known classification schemes are outlined below.


The SCOP (Structural Classification of Proteins) database maintained at the MRC Laboratory of Molecular Biology and Centre for Protein Engineering elaborates structural and evolutionary relationships between proteins of known structure. Because current automatic structure comparisons tools cannot dependably identify all such relationships, SCOP has been constructed using a combination of manual inspection and automated methods. The task is complicated by the fact "protein structures show such variety, ranging from small, single domains to vast multi-domain assemblies". In some cases (e.g., some modular proteins), it may be significant to discuss a protein structure at the same time both at the multi-domain level and at the level of its individual domains.

SCOP classification

Proteins are classified in a pecking order fashion to reflect their structural and evolutionary relatedness. Within the hierarchy there are many levels, but principally these describe the family, superfamily and fold. The boundaries among these levels may be subjective, but the higher levels generally reflect the most clear structural similarities.

• Family. Proteins are clustered into families with clear phylogenetic tree relationships if they have sequence identities 30%. But this is not an absolute measure - in some cases (e.g., the globins), it is possible to interpret common descent from similar structures and functions in the absence of significant sequence identity (some members of the globin family share only 15% identities).

• Superfamily. Proteins are placed in superfamilies' when, in spite of low sequence identity, their structural and functional characteristics suggest a common evolutionary origin.

• Fold. Proteins are classed as having a common fold if they have the same important secondary structures in the same arrangement and with the same topology, if or not they have a common evolutionary origin. In these cases, the structural similarities could have arisen as a result of physical principles that favor particular packing arrangements and fold topologies. SCOP is accessible for keyword interrogation via the MRC Laboratory Web server.



The CATH (Class, Architecture, Topology, and Homology) database is a hierarchical domain classification of protein structures maintained at UCL (Orengo et al, 1997). The sample is largely derived using automatic methods, but manual inspection becomes necessary where automatic methods fail. Different categories within the classification are identified by means of both unique numbers (by analogy with the enzyme classification or E.C. system for enzymes) and descriptive names. Such a numbering scheme allows efficient computational handling of the data. There are five levels in the hierarchy:

• Class is derived from gross secondary structure content and packing. Four classes of domain are recognized: (i) mainly-oc, (ii) mainly-p, (iii) oc-p, which includes both alternating oc/p and a+P structures, and (iv) those with low secondary structure content.

• Architecture describes the arrant arrangement of secondary structures, ignoring their connectivity; it is assigned manually using simple descriptions of the secondary structure arrangements (e.g., barrel, roll, sandwich, etc.).

• Topology gives a description that comprehends both the overall shape and the connectivity of secondary structures. This is achieved by means of structure comparison algorithms that use trial and error method derived parameters to cluster the domains. Structures in which at least 60% of the larger protein matches the smaller are assigned to the same topology level.

• Homology domains that share 35% sequence identity and are thought to share a common and homologous ancestor. Similarities are first identified by sequence comparison and subsequently by means of a structure comparison algorithm.

• Sequence provides the final level within the hierarchy, whereby structures within homology groups are further clustered on the basis of sequence identity. At this level, domains have sequence identities >35% (with at least 6O°/o of the larger domain equivalent to the smaller), indicating highly similar structures and functions.

CATH is accessible for keyword interrogation via UCL's Biomolecular Structure and Modelling Unit Web server.



A major resource for getting at structural information is PDBsum, a Web-based compendium maintained at UCL. PDBsum provides summaries and analyses of all structures in the PDB. Each summary gives an at-a-glance overview of the details of a PDB entry in terms of resolution and R-factor, numbers, of protein chains, ligands, metal ions, secondary structure, fold cartoons and ligand interactions, etc. This is vital, not only for visualizing the structures held back in PDB files, but also for drawing together in a single resource information at the ID (sequence), 2D (motif) and 3D (structure) levels. Resources of this type will become more and more important as visualization techniques improve, and new-generation software allows more direct interaction with their contents. PDBsum is approachable for keyword interrogation through UCL's Biomolecular Structure and Modelling Unit Web server.

Advances in computer technology will play an important role in simplifying the task of sequence analysis in the near-future; developments such as CORBA, which facilitates distributed programming, and the Internet object-orientated programming language Java are braced to create a new generation of interactive tools that, for the first time, allow seamless integration of distant information systems at the desktop. Software that provides both 'intelligent' conserved views of the results and access to the raw search data, will cater, at the same time, for the less experienced and for the expert user. In addition, interactive ID, 2D and 3D visualization tools will offer new ways of interacting with dry computer outputs, helping to transfer sequence, motif and structure information into biological knowledge.


Protein design

 What is the protein design? 

Protein design is used to make a new protein which has never existed in nature with a new function and structure. In order to do that, comprehensive and wide knowledge about the proteins is needed. Unifying information above requires computer technology, in silico method.

The use of computational techniques to create peptide- and protein-based therapeutics is a major challenge in medicine. The most directed goal, defined about two decades ago, is to use computer algorithms to identify amino acid sequences that not only adopt 3-D structures but also perform specific functions. To those familiar with the field of structural biology, it is certainly known that this problem has been described as "inverse protein folding". That is, while the grand challenge of protein folding is to understand how a particular protein, defined by its amino acid sequence, finds its unique 3-D structure, protein design involves the discovery of groups of amino acid sequences that form functional proteins and fold into specific target structures. Experimental, computational, and hybrid approaches have contributed to advances in protein design. Applying mutagenesis and rational design techniques, for example, experimentalists have created enzymes with varied functionalities and increased stability. The coverage of sequence space is highly constrained for these techniques, however. An approach that samples more diverse sequences, called directed protein evolution, iteratively uses the techniques of genetic recombination and in vitro functional assays. These methods, although does a better job of sampling sequence space and generating functionally variable proteins, are still restricted to the screening of 103 - 106 sequences.

Computational methods play a range of roles in protein engineering from the simple use of visualization to guide intellectual design to fully automated de novo design algorithms.. Here we will focus on the former, that is, computational methods that complement human perceptivity in rational protein engineering. The approaches can loosely be grouped into three classes: (1) methods based on analysis of 1Ëš sequence; (2) the visual analysis of protein structure; and (3) fast estimation of effects of mutation. The mechanistic details of performing sequence and structural analysis have been elaborately discussed in other texts, and thus the focus here is on the usage of these approaches. The approaches discussed here all involve use of computational software that is either available as a Web service or as a freely available, downloadable program.

The ramification of the computational protein design problem is very large. This collection of sequences, each constructed in a single copy, would occupy a space larger than the universe. Additional complexity comes if one tries to model protein flexibility. It remains obstinate to perform full-scale molecular dynamics modifications within the protein design calculation. Hence, most protein design studies consider only mobility of protein side chains while the protein backbone remains fixed.

Flexibility of amino-acid side chains is typically modeled by using a discrete set of statistically significant trial defined conformations; called rotamers. With a larger number of rotamers used to represent each amino acid, the movement of side chains is modeled more precisely; but clearly the design problem becomes more complex. In protein design, the aim is to search over this large sequence space and to find the best (lowest energy) sequence for the particular protein scaffold. The input to the protein design problem usually consists of a protein backbone structure, N sequence positions to be designed, the amino acids (and their respective rotamers) validated at each position, and an energy function. The energy function, used to evaluate candidate protein sequences, is usually pairwise and thus consists of two primary components, corresponding to rotamer-template and rotamer-rotamer interactions. The template can add up the fixed backbone atoms, residues not subject to subsequent optimization, and atoms within the rotamer (for which pseudoenergies are derived from rotamer library statistics)

Numerous search algorithms have been developed to search the energy landscape for energyless sequences and their preferred amino acids at each position. These algorithms are divided into two classes: stochastic and deterministic. Stochastic algorithms use probabilistic trajectories, where the resulting sequence completely depends on initial conditions and a random number generator. Stochastic algorithms do not guarantee finding the GMEC sequence, but they can always find an approximate solution. This may be sufficient, considering that simplifying assumptions in the energy function and in modeling protein flexibility ineluctably result in uncertainty in defining the best protein sequence. In contrast, deterministic algorithms always produce the same solution with the given the same parameters. Many, but not all, of the deterministic algorithms are guaranteed to find the GMEC (the global minimum energy configuration) sequence if they possibly converge. However, convergence is not guaranteed and the possibility of convergence is reduced with increasing problem size. In the following sections, we explain in detail several search algorithms that have been used in protein design studies, and mention some experimental studies in which they are utilized.


1) Name the three protein databases.

2) Give example for primary nucleic acid databases

3) Give example for primary DNA databases

4) Who and where the protein sequence database was developed first

5) What are the four forms of PIR and explain each one's function


7) Which are the two main sections of TrEMBL

8) Explain briefly about NRL 3D database

9) List some problems in using NRDB

10) Example for non redundant protein database

11) Which protein sequence database amalgamates different protein sourses

12) Give two examples for secondary database