Integrating Genome Analysis And Data Repository Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The properties that characterize a living organism are based on fundamental set of generic information known as its genome. A genome is composed of one or more DNA molecules, each organized as a chromosome. The DNA has all the information necessary for the functioning of a cell. DNA sequence determines the protein sequence which determines the protein structure, which in turn determines the protein function.

This chapter presents the biological background, such as genes, DNA, RNA and protein and their interaction, necessary to understand the process of sequencing performed at the DNA Core Facility, University of Missouri, Columbia. The first section of this chapter describes the concept of Genes and Chromosomes from a high level perspective. Subsequent sections describe the DNA structure and other biological concepts such as genes proteins and protein folding, transcription and translation of genes, expressed sequence tags (EST) and genome sequencing.

2.1 Genes and Chromosomes

An organism's body us made up of cells and each cell has a complete set of instructions about how to process the genetic information encoded in the genes of an organism. This set of instructions is called the genome. Every species has a different set of genome that contains all the biological information needed to build and maintain a living organism. This biological information is contained in genome in the form of deoxyribonucleic acid (DNA) and is divided into discrete units called genes.

Genes are encoded in DNA molecule which in turn is organized into chromosomes. A chromosome literally means a colored body, which are long strands of genes. A gene is defined as a well-structured and localized region in the genome that encodes the information necessary for producing one or more proteins [6]. A gene includes not only the actual coding sequences but also nucleotide sequences required for the proper expression of genes. Genes instruct each cell type such as skin, brain and liver, to make discrete sets of proteins at just the right times [11].

Genes make up about 1 percent of the total DNA is our genome and are made up of nucleic acid sequences which are translated into mRNA and then into proteins. In the human genome, the coding portions of a gene, called exons, are interrupted by intervening sequences, called introns. An intron is a DNA region in genes that is not translated into proteins, whereas extrons are the sequences that actually code for proteins. There are three classes of genes. Protein-coding genes are templates for generating molecules called proteins. Each protein has a distinct purpose in the body of the organism. RNA specifying genes are templates for producing RNA which are distinct from the protein producing genes. Finally, untranscribed genes are regions of genomic DNA that have some functional purpose but don't achieve that purpose by being transcribed or translated to create another molecule. An abstract view of the gene structure is shown in figure 2.1.

Figure 2.1: Gene Structure (adapted from [8])

A gene also forms the basis for heredity. In 1909, Danish botanist Wilhelm Johanssen coined the word gene for the hereditary unit found on a chromosome. Traits such as eye color, hair pattern and diseases are passed along generations from parent to the offspring. Most genetic variations occur when DNA is duplicated in the cell. Presence of an extra gene, mutated gene or a missing gene gives rise to genetic diseases. Mutations are not always bad; they provide an opportunity for a species to adapt to a new environment. Disease inheritance is complicated and is not yet fully understood. Understanding every gene in the human genome is the goal of the current large scale sequencing efforts, but this goal will take decades to achieve [8].

2.2 DNA structure

The quantitative relationships between adenine and thymine, and cytosine and guanine led Watson and Crick to propose a structural model for DNA in 1953. This model of DNA was based on several observations such as the x-ray crystallography experiments, Chagraff's rules of pairing of bases, experimental evidence that the bases were connected by hydrogen bonds [3]. DNA contains the information that is necessary to produce proteins, which govern all the functioning of the body of an organism. The information is in the form of a chemical (DNA) code.

Watson and Crick combined this disparate information to propose the double helix. The double helix of DNA, which has been studied in atomic detail using x-ray crystallography, is a structure in which adenine pairs with thymine and cytosine pairs with guanine by hydrogen bonding. The hydrogen bended base pairs form the core of the molecule.

The DNA (Deoxyribonucleic acid) code is made up of very long chains of four basic building blocks: Adenine (A) and Guanine (G), and Thymine (T) and Cytosine (C). These building blocks of DNA are called bases and each base has a slightly different composition containing combination of oxygen, carbon, nitrogen and hydrogen. The precise order of the bases of the DNA determines the final product made from that gene. The bases pair up to form the rungs of a double helix ladder. Base A can only pair with base T, and vice versa; and base G can only pair with base C, and vice versa. Each rung in the double-helix structure is referred to as a base pair (BP) [8]. A DNA chain, called as a strand, has a sense of direction, in which one end is chemically different than the other. A chromosome consists of two of these DNA chains running in opposite directions, which correspond to reading either from 3' or 5' end.

Figure 2.2 shows the structure of the partial double-stranded region of DNA.

Fig: 2.2: DNA Double Helix (adapted from [11])

The specific pairing of bases of DNA suggests a mechanism by which each strand of DNA can serve as the template for synthesis of RNA. This evidence led Crick to propose his central dogma: that DNA directs its own replication and its transcription into RNA and that RNA is translated into proteins [3].

The main role of DNA molecules is storage of instructions needed to construct other components of cells, such as proteins and RNA molecules. The DNA segments that carry this genetic information are called genes. An analysis by scientists at Ohio State University suggests that humans have between 65,000 and 75,000 genes. The human genome contains 23 chromosomes, containing an estimated 3 billion base pairs.

2.3 Transcription and Translation of Genes

DNA can act not only as a template for making copies of itself but also act as a blueprint for ribonucleic acid (RNA). A cell processes the DNA structure and transcribes it into RNA. RNA is structurally similar to DNA. The individual chemical units that make up RNA are A, C, G and U for Uracil (instead of T as in DNA).

A cell processes the DNA and produces three main types of RNA which are messenger RNA, transfer RNA and ribosomal RNA. Messenger RNA (mRNA) molecules are RNA transcripts of genes. They carry information from genome to the ribosomes, which are protein builders of the cell and help in the process of synthesizing proteins. Transfer RNA (tRNA) molecules are untranslated RNA molecules that transfer amino acids, which are the building blocks of proteins, to the ribosome. Finally, ribosomal RNA (rRNA) molecules are untranslated RNA components of ribosomes, which are complexes of protein and RNA. rRNAs are involved in anchoring the mRNA molecule and catalyzing some steps in the translation process [3].

When a gene is transcribed into RNA, the entire sequence, including the introns, is copied. This primary transcript of RNA is further processed to produce the protein coding mRNA. Translation of mRNA into protein is the final major step in putting the information in the genome to work in the cell.

Errors in replication and transcription of DNA are relatively common and if these errors occur in the reproductive cells of an organism then they can be passed on to its progeny. Alterations in the sequence of DNA, caused by the above errors, are called as mutations and can cause harmful results in the progeny. If a mutation does not kill an organism before it reproduces, then those mutations can become fixed in the population over many generations [3]. The slow accumulation of these mutations is responsible for the process known as evolution. Darwin's theory of evolution by natural selection describes the observable process of evolution.

2.4 Proteins and protein-folding

Proteins are the linear polymers built from small molecules called amino acids. Each of the 20 amino acids found in proteins have a different chemical nature. Proteins play a variety of roles in life processes. There are structural proteins such as the skin of an organism, there are proteins that catalyze chemical reactions, transport and storage proteins such as hemoglobin and other proteins of the immune system.

The chemical sequence of a protein is called its primary structure, but the way the sequence folds up is important to the functioning of the protein. The chain of amino acids folds into a curve in space by folding pattern. Proteins show a variety of folding patterns. The amino acid sequence of a protein dictates its three-dimensional folding structure. When placed in a medium of suitable solvent and temperature conditions, like the one provided by the cell interior, proteins fold spontaneously to their native states. The correct three dimensional structure of a protein is essential for its proper functioning. Sometimes a protein structure does not fold into a correct pattern. Several diseases are caused by the accumulation of such incorrectly folded proteins.

Proteomics is the cataloging and analysis of protein to determine when a protein is expressed, how much is made and other protein interactions. Proteomics is the systematic analysis of protein profiles of tissues. The word proteome refers to all proteins produced by a species at a particular time. The goals of proteomics are: i) to identify every protein, ii) to determine the sequence of each protein and entering the data into databases and iii) to analyze globally protein levels in different cell types and at different stages of development [2].

2.5 Expressed Sequence Tags (EST)

Expressed sequence tag (EST) is a partial sequence of a randomly selected DNA and is used to identify genes for a particular tissue. This partial sequence is selected from cloned DNA libraries. Majority of the sequences available in public data repositories is an EST, which are not full length DNA sequences but they are partial sequences.

Preparation for sequencing an EST begins with the laboratory procedure which includes creation of the clones in the cDNA library, isolation of the mRNA and other chemical processes necessary. Individual clones are picked from the library and one sequence is generated from each of the 5' and 3' end. Since EST's are short, they represent only fragments of genes which are between 200 to 500 bases in length.

Once the sequence for an EST has been generated, the researchers need to know if the EST represents a new gene. The DNA data repositories are searched to find sequences that are similar. Normally BLAST is used for sequence similarity searches. If results show similarity to an existing sequence then the procedure for classifying the hit would give an indication of whether a novel gene has been found. But if the BLAST search shows no sequence similarity then it cannot be assumed that a new gene has been found. This is because the EST may represent a non-coding sequence for a gene that is not in the public data repositories.

2.6 Genome sequencing

Genome-sequencing is a process that determines the entire DNA sequence of an organism. The organism's genetic relationships to other ancestors, its origins, its susceptibility to other diseases are determined by the process of genome-sequencing. The art of determining the sequence of DNA is known as Sanger sequencing, described in section 3.1, after its brilliant pioneer. This technique involves the separation of fluorescent labeled DNA fragments according to their length on a polyacrilimide gel (PAGE). Bacteriophage fX174, was the first genome to be sequenced, a viral genome with only 5,368 base pairs (bp) using the popular shotgun sequencing method, described in section 3.3.

The relative success of initial genome-sequencing methods gave rise to many other genome-sequencing projects all over the world. The most ambitious project being the human genome project, with the purpose of sequencing the complete human genome. The human genome project is described in section 3.4.

Genome level sequencing produces the base pair sequence of an organism's entire genome. The purpose of sequence analysis is to analyze the DNA and RNA sequences of organisms that are the subject of investigations by biologists. Researchers at the University of Missouri, Columbia perform four main kinds of experiments. One of them is to sequence Genomic DNA so as to generate the raw DNA blueprint of the organism. Another one is to look at the sequence of messenger RNA that are generated from Genomic DNA and which correspond to the proteins in the body of the organism. A third kind of experiment is called CHIP-seq (Chromatin Immunoprecipitation sequencing) experiments. In these types of experiments a subset of Genomic DNA is identified. The fourth kind of experiment is RNA type of experiment which is used for identifying small RNA, which are smaller than the messenger RNA and are around 20 to 30 bases long. These small RNA are important regulatory elements which bind to messenger RNA and prevent them from translating into proteins.

The cost of genome sequencing has been decreasing along the years. With the cost of human genome project running into billions of dollars, the advancement in the technologies has reduced the cost to around ten million dollars [17]. Around 2003, a lot of private organizations began investing their efforts in alternative sequencing technologies. The human genome project provided a lot of reference data for researchers to work on technologies, such as highly sensitive charged couple devices (CCD) cameras, large storage devices that helped to bring down the cost of sequencing projects. Recently Complete Genomics offered genome-sequencing services for around five thousand dollars only.

Chapter 3

Genome analysis

Genes specify the nucleotide sequences of chromosomal DNA that code for proteins and structural RNAs. Along with coding sequences, the genome of an organism contains non-coding sequence information. This non-coding sequence information varies in various organisms, ranging from a few percent in viruses and bacteria to a whopping 98% in humans [12]. Whole genomes are of interest not only because they contain all the hereditary genetic component of an organism, but also because they reflect the complexity of an organism. With genome projects of many organisms being completed one by one, efforts are made to understand genomes from all domains of life.

It has been observed that genomes from related organisms are conserved, including the relative position of genes, amount of noncoding sequences and the chromosomes. Genes are not randomly distributed on the genome and their location is associated with the size and organization of the chromosome. Several data repositories provide information about nucleotide sequences, gene organization about various organisms. These data repositories are publicly accessible and are described in detail in chapter 4.

This chapter describes the Genome projects in general, the Shotgun approach of gene sequencing and the Human Genome project in particular.

3.1 The Genome Projects

Of all the genomes that have been sequenced so far, 95% of them are either viral, bacterial chromosomes or plasmids. Most are very small genomes ranging from several thousand to hundreds of thousands of base pairs. NCBI's Entrez Genome lists several organisms of interest including those of the honey bee, cat, chicken, chimpanzee, cow, dog, frog, mouse, rat, pig and sheep. Several agriculturally important crop plants are also sequenced including tomato, soybean, barley, wheat, corn, oat, cotton, potato and rice. University of Missouri-Columbia, performs sequencing research on bovine (cow), ovine (goat), porcine (pig) and agricultural plants such as soybean and maize root. Researchers at the University of Missouri-Columbia, performing research on porcine genome have been able to identify the gene responsible for the early death of new born piglets. This research can be used to help farmers identify whether the animal will survive, while the animal is still in the womb. Also the research on maize root genome has the implications of identifying the techniques of growing plants in the absence of sufficient water.

As new genome projects are started and old ones completed, the monitoring of these sequencing projects can be found on the Sanger web site ( The sequencing experiments that are completed are annotated and submitted to GenBank, EMBL and DDBJ. Genomes are included into these data repositories only if complete sequence and taxonomic information of the organism is available. The central goal of the genome sequencing is to build a complete collection of all genes of known organisms. At NCBI, the Reference Sequence (RefSeq) collection provides a comprehensive, integrated, nonreduntant set of sequences, including genomic DNA sequences, transcript (RNA) sequences and protein products [12]. RefSeq serves as the basis for medical and functional studies, as well as studies in evolution and systems biology.

Genome sequencing projects perform various experiments determine the complete genomic DNA sequence of an organism. One approach to genome sequencing is to break up the genome into random, overlapping fragments, then to sequence the fragments and assemble the sequences using various computer algorithms [2]. Nowadays, most of the sequencing experiments are automated. Sanger's chain-terminator method is the most commonly used sequencing procedure in modern laboratories. Each reaction mixture is labeled with a different fluorescent tag, which allows the base to be identified by the scanner. The mixture is scanned with a laser, which excites each fluorescent band on the gel in the sequence. This signal is then filtered out using four colored filters and this signal is then detected by the scanner. Bases are analyzed using basecalling software and based on the strength of this fluorescence signal, the sequence of the complementary strand of the original DNA is generated.

3.2 Analysis of Raw Sequence Data: Basecalling

The automated sequencing machines perform the analysis of the Raw sequence data using a process called Basecalling. The sequencing machine takes photographs of the bases highlighted by the fluorescent material. This raw data is processed and sequence is assigned. The end user of the genome sequence data can perform basecalling operations on the images as desired but many times have to rely on the sequence that has been assigned by the basecalling software. Due to the physical limitations of the sequencing machine and the rate of the chemical reactions, the basecalling software can introduce some errors in the sequence data. Any sequence in GenBank is likely to have at least one error. As sequencing projects deal with inherent errors in the sequencing process, the accuracy of sequencing can be improved by sequencing each region of a genome multiple times.

There are a variety of commercial and non-commercial tools for automated basecalling. Most of the sequencing machines are equipped with the proprietary basecalling software and are customized for that particular sequencing hardware. One of the popular non-commercial basecalling software if Phred, which is available from the University of Washington Genome Center. Phred uses a Fourier analysis to resolve fluorescence traces to predict an evenly spaced set of peak locations. It then annotates the result of basecalling with the probability that the base call is an error, higher Phred scores mean lower probability of an error. Researchers use these Phred scores to determine if the region of the genome needs to be resequenced.

3.3 Shotgun approach of gene sequencing

Shotgun DNA sequencing is the approach in which either the whole genome or a defined subset of the genome is broken into random fragments. These fragments are of manageable length and are cloned into plasmids. Plasmids are simple biological vectors that can incorporate a piece of DNA and reproduce it to provide sufficient material for sequencing.

Although only 400-500 bases for each fragment are sequenced, the amount of sequenced DNA spans every base pair of the genome several times since many clones are sequenced. Once the short sequences are obtained, they must be assembled into a complete sequence using a process called sequence assembly. DNA sequencing using a shotgun approach provides thousands or millions of minisequences, each 400-500 fragments in length. These fragments are random and can overlap each other. These fragments need to be tiled together into one continuous sequence.

The primary task of the sequence assembly program is to identify sequence overlaps between fragments. Also since there are some fragments which failed the cloning process does not produce any sequence, this leaves gaps in the DNA sequence. These gaps complicate the sequence assembly process. The Phrap program from the University of Washington Genome Center, can be used for assembling sequence fragments. TIGR Assembly is another sequence assembly program that works well for small genomes.

3.4 Human Genome Project

Robert Sinsheimer, a molecular biologist, made the first proposal for the Human Genome Project (HGP) in 1985, while he was the chancellor of the University of California. The Department of Energy (DOE) and National Institute of Health (NIH) supported the HGP and were involved in the advisory board. The goal of the human genome project was to identify all the approximate 30,000 genes in the human DNA and determine the 3 billion nucleotide base pairs of human DNA. All the data and information generated from the HGP was stored in data repositories which are publicly accessible. Another goal of the HGP was to develop tools for analyzing the genomic data. Various tools and technologies were developed to perform tasks such as shotgun sequencing approach, genome assembly, BLAST search, clustering of sequences.

Initially it was projected that the HGP would be completed by the year 2005, but due to rapid technological development, the project was completed by 2001. Major milestone in the HGP was on June 26, 2000 when the members of the HGP announced the working draft of the human genome [12]. The DNA sequence of the human genome was made publicly accessible from the National Center for Biotechnology Information (NCBI) on February 15, 2001.

The human genome has shown an important insight into the distribution and relative size of structural genes and non-coding regions [12]. About 95% of the human genome is dominated by the non-coding region. Of this, about a fourth is noncoding regions of genes such as introns and regulatory elements. The rest of the genome has nondescriptive sequences and repeat sequences. HGP identified 25000 genes, out of which 19000 are identified based on corroborative evidence, while 6000 are identified computationally only [12]. The human genome contains 3.2 billion nucleotide bases (A, C, T and G). The sizes of genes vary greatly, the average gene consists of 3000 bases whereas the largest known human gene is dytrophin which is 2.4 million bases long. The functioning of more than 50% of discovered genes is still unknown. About 2% of the genome encodes instructions for the synthesis of proteins. About 40% of the human proteins showed similarity with fruit-fly or worm proteins. The HGP also identified candidate genes for numerous diseases and disorders including breast cancer, muscle diseases, deafness and blindness.

Since all the data generated by the HGP is publicly accessible, it will be beneficial to scientists in many ways. The findings through various genome research programs will be beneficial in molecular medicine to develop better disease diagnosis, to design drugs based on molecular information. Many other applications can be developed such as developing new biofuels, developing techniques to combat biological warfare, cleaning up toxic waste, to identify paternity and other family relationships.

Now with the human genome project being completed and the data available pertaining to all the human genes, the major challenge faced by the scientific community is to understand the content of the human genome. Deciphering the parts of the DNA that code for certain functions and to identify the regions of DNA that are responsible for diseases will help the medical community at large. Further genome-based research will eventually enable medical science to develop highly effective diagnostic tools, to better understand the health needs of people based on their individual genetic make-ups, and to design new and highly effective treatments for disease [13].

Chapter 4

Biological databases

The exponential growth of the biological data presents the problem of managing this data in a manner that allows researchers easy access and ability to deposit, retrieve sequences. Since researchers all over the world are performing sequencing experiments and generating genomic data, they need a way to collect, logically arrange and preserve this biological information in a meaningful way. The database technology developed in computer science can be used to achieve these goals, as databases perform the tasks of minimizing data redundancy and achieving data independence.

Different biological data repositories were created and data was gathered all over the world. The databases created out of these data repositories can be searched or cross-referenced either over the Internet or using downloaded versions on local computers. Many organizations create their own data repositories, called mirror sites, so that local resources can be utilized, resulting in faster access and retrieval. These mirror sites need to be kept updated with the primary data repositories.

Developing data repositories for storing biological data does not help researchers until relevant programs and tools are created that allow researchers perform various operations on biological data. Researchers perform tasks such as retrieving sequences from the database that are similar to the given sequence, finding protein structures from a protein sequence, finding protein structures in database that adopt a similar 3D structure to the given sequence of protein. The data repositories provide wide variety of tools for information retrieval and analysis operations such as retrieval of sequences from the database, sequence comparison, translation of DNA sequences to protein sequences, pattern recognition and different molecular graphics and visualization programs.

Data repositories are classified into generalized databases and specialized databases. Generalized databases are DNA, protein, carbohydrate or similar databases, whereas specialized databases are expressed sequence tags (EST), genome survey sequences (GSS), single nucleotide polymorphism (SNP), sequence tagged sites (STS) or similar databases. Generalized databases are again classified as sequence databases and structure databases. Sequence databases contain the individual sequence records of nucleotides, amino acids or protein, whereas structure databases contain the individual sequence records of biochemically solved structures of macromolecules such as 3D protein structures [2].

Many databases and software application that work with sequence data require a standard format that represents nucleic acid and protein sequence information. Some of the sequence formats are the EMBL format, FASTA format, GCG format, GenBank format, IG format [14]. A record in the data repository contains the three sections, the header section includes the description of the sequence, its organism of origin, allied literature references and cross links to related sequences in other databases. Other two sections are the feature table that contains a description of the features in the record like coding sequences, exons, repeats, promoters and the sequence which is often more easily analyzed by the computer.

There are three premier institutes in the world, which constitute the International Nucleotide Sequence Database Collaboration. These are the National Center for Biotechnology Information (NCBI), the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ). Data are stored in these data repositories and are exchanged daily to keep all of them in a consistent state. There are many other biological databases on the Internet, some of them are described below.

4.1 NCBI GenBank

The National Center for Biotechnology Information (NCBI) was established in 1988 on the campus of National Institute of Health, Maryland. The role of NCBI is to develop new technologies to understand the molecular and genetic processes that underlie health and diseases [2] .

NCBI GenBank maintains sequence data from every organism that are sequenced so far. It contains sequence data from mRNA to cDNA clones, expressed sequence tags (EST) to high throughput genome sequencing data. NCBI GenBank incorporates sequences from publicly available sources, primarily from direct author submissions and large-scale sequencing projects [2]. There are two ways to search GenBank, first is to use a text based query to search the annotations associated with each DNA sequence and the second way is to use a method called BLAST. Each sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). The main purpose of GenBank is to encourage the scientific community to access and deposit the sequences that are freely available to researchers all over the world.

4.2 EMBL

The European Molecular Biology Institute (EMBL) is located at Hinxton, England. EMBL also contains sequences from direct author submissions, large-scale genome sequencing projects and from patent applications. The database is produced in collaboration with DDBJ and GenBank. The EMBL database communicated with NCBI GenBank and DDBJ and constantly updates the database. The rate of growth of DNA database has been exponential, with the number of sequences being doubles every 9-12 months.

4.3 DDBJ

DDBJ database is maintained at the National Institute of Genetics, Japan and is the sole nucleotide database in Asia. Sequences can be submitted through the web interface. DDBJ database is kept synchronized with GenBank and EMBL through daily updates, hence the three data banks share virtually the same data at any given time. The database includes nucleotide sequence and the information of submitters, references, source organisms, and the biological nature such as gene function and other property of the sequence.

4.4 Ensembl

Ensembl is intended to be the universal information source for the human genome. The goals of Ensembl are to collect, annotate and distribute information about the human DNA sequence to the scientific community. Ensembl is a joint project of the European Bioinformatics Institute and the Sanger Center. Data collected at Ensembl include the genes, SNPs, repeats and homologies pertaining to the human genome.

4.5 PIR Databases

The Protein Sequence Database was developed at the National Biomedical Research Foundation (NBRF) at Georgetown University in the early 1960s. The purpose of Protein Sequence Database was to investigate evolutionary relationships between proteins. The Protein Information Resource (PIR) has been maintained by NBRF that incorporates several databases about proteins. PIR database is split into four distinct sections, namely PIR1, PIR2, PIR3 and PIR4. They differ in terms of the quality of data and levels of annotation provided. PIR1 includes fully classified and annotated entries; PIR2 contains preliminary entries which have not been fully reviewed and may contain redundancy; PIR3 contains unverified entries, which have not been reviewed; and PIR4 entries fall into the category of conceptual translations.


The Swiss Institute of Bioinformatics (SIB) collaborates with EMBL Data Library to provide annotated database of amino acid sequences. SWISS-PROT is a curated protein sequence database, which provides high level annotation and descriptions of the functions of the protein and structure of its domains. Entries in SWISS-PROT start with an identification line and finish with a // terminator. SWISS-PROT is interlinked to many other databases. The structure if the database and the quality of annotation have made SWISS-PROT the database choice for most research purposes [2].

4.7 PubMed

PubMed is maintained by the National Library of Medicine and includes a bibliographic database MEDLINE as well as links to other scientific articles maintained by journal publishers. Bibliographic databases contain published articles, abstracts and scientific peer-reviewed papers. MEDLINE is the National Library of Medicine's (NLM) premier bibliographic database that contains references to journal articles in the life sciences with a concentration on biomedicine. A distinctive feature of MEDLINE is that the records are indexed with NLM's Medical Subject Headings (MeSH). The database contains citations from 1950 to present, with some older material [15]. PubMed is extensively used by researchers as it provided updated information from different scientific publications. Some of the features of PubMed are to get articles that are similar to a given article, inks to full-text articles on participating publishers' Web sites, filters for searching clinical studies and systematic reviews.

4.8 Entrez

Entrez facility was developed at NCBI that allows retrieval of molecular biology data and bibliographic citations from NCBI's integrated databases. Entrez provides access to DNA sequence data from GenBank, EMBL, DDBJ, and protein sequence data from SWISS-PROT, PIR, PDB, genome and chromosome mapping data, 3D protein structures from PDB and the PubMed bibliographic database.