Study of the human genome project

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The human genome project (HGP) was first proposed in 1985, in order to establish an effort to understand our shared molecular heritage and to gain the necessary knowledge of the human organism for the progress of medicine and health sciences, such as the roots of disease or genetic variants that increase the risk of common diseases (Barnhart, 1989; Sinsheimer, 1990). The project was officially initiated in 1990 with the objective of finding out the DNA sequence of the entire euchromatic human genome within 15 years. It was started as an international effort known as International Human Genome Sequencing Consortium (IHGSC) and more than 18 countries acted as contributors (Lander et al., 2001). A working draft of the genome was released by the UCSC genome bioinformatics group in 2000, a complete draft was released in 2003, two years earlier than expected, and the sequence of the last chromosome was published in 2006, with further analysis still being published today. However, it is important to note that the project did not sequence the complete genetic material found in human cells; about 8% of the genome, mostly heterochromatic areas found in the centromeres and the telomeres, remain unsequenced due to technological restraints.

Strategies and techniques used in the human genome project

Due to the enormity of the task and the uncertainty of what results would be obtained, the HGP engaged in unveiling the human genome sequence in two phases: the shotgun phase and the finishing phase.

The shotgun phase

The sequencing of the human genome by the IHGSC was performed by a hierarchical shotgun method - or "clone by clone method" - with subsequent assembly of the sequenced segments. Shotgun sequencing (Anderson, 1981; figure 1) consists in breaking up the DNA randomly into numerous small segments, which are then sequenced using the chain termination method - also known as Sanger method - to obtain reads.

In order to sequence the DNA by the shotgun technique, first DNA clones needed to be obtained to be sequenced. These clones were derived from DNA libraries made by ligating DNA fragments generated from anonymous human donors into bacterial artificial chromosome (BAC) vectors. BACs are derived from bacterial chromosomes which have been genetically engineered and, once the DNA is inserted, they can be inserted into bacteria such as E. Coli where the target DNA will be copied by the bacterial DNA replication machinery (O'Connor et al., 1989).

Then, individual BAC clones sequences selected for sequence analysis were further fragmented into pieces of various sizes, ranging from 2000 to 300000 base pairs, and the smaller DNA fragments were subcloned into vectors to create a BAC-derived shotgun library. These fragments are then mapped into a particular region of a given chromosome before being selected for sequencing, thus the hierarchical nature of the process. The DNA segments are then sequenced using the chain termination method, which uses dideoxynucleotide triphosphates (ddNTPs) as DNA chain terminators. Finally, the multiple overlapping reads obtained by sequencing are assembled into a continuous sequence by using complex algorithms and supercomputers (Staden, 1979). This method has the advantage that all sequence blocks, known as contigs, and scaffolds derived from a BAC belong to a single compartment with respect to the genome.

At the same time, a private company, Celera Genomics, started the same project using whole genome shotgun sequencing and pairwise end sequencing, also known as double-barrel shotgun sequencing. Whole genome shotgun sequencing involves the random fragmentation of the entire human genome (figure 1). The random DNA fragments were sequenced from both ends of each fragment of DNA and the resulting DNA sequences were assembled using computational methods and highly sophisticated algorithms to identify overlapping DNA sequences (Venter et al., 2001). This process allowed Celera to reconstruct the entire human genome leaving out many of the early time-consuming steps employed by the IHGSC. However, both groups used the same method for sequencing DNA: the Sanger method (figure 2). Ultimately, due to the use by Celera of the previous published data by the IHGSC, both groups finished sequencing the human genome at a similar time and two years ahead of schedule.

Figure . The hierarchical versus the whole-genome shotgun methods. Hierarchical shotgun method involves decomposing the genome into a series of overlapping BAC clones, sequencing them and reassembling each BAC, finally merging the sequences of adjacent clones. Whole-genome shotgun method involves performing shotgun sequencing on the entire genome and attempting to reassemble the entire thing (from Waterston et al., 2002).

Figure . Chain termination sequencing. Both IHGSC and Celera programs used the same technique to sequence their DNA libraries. The DNA template of interest was combined with DNA polymerase, a single-stranded DNA primer, free deoxynucleotide bases, and a mixture of fluorescently labeled dideoxynucleotide bases that would terminate new DNA strand synthesis once incorporated into the end of a growing DNA strand. This process provides newly synthesized DNA strands of random different lengths. To determine the sequence DNA strands were electrophoresed through a gel matrix that permitted single-base differences in size to be easily distinguished. Then a laser is run through the gel determining the colour of the bases and intensity of the signal and thus delivering the DNA sequence (from Hood & Galas, 2003).

The finishing phase

The finishing phase consisted in filling in the gaps and determining those DNA sequences in ambiguous areas such as centromeres and telomeres that had not been obtained during the previous phase. This phase yielded 99% of the human genome in final form, which contained 2.85 billion nucleotides, with a predicted error rate of 1 event per 100,000 bases sequenced (IHGSC, 2004).

Nonetheless, the finishing phase is still taking place today. The next step in the HGP is the complete annotation of the human genome, including classification of the raw DNA sequence into well-defined gene structures, which will predict the encoded proteins and their possible functions. However, classical sequencing technology is not adequate for these tasks, which require speed and lower costs. Consequently, these days the HGP takes advantages of newly developed techniques that allow the sequencing of the genome in a more rapid and less costly manner in order to obtain information from the human DNA sequence.

New techniques that could improve the HGP

These days, new sequencing technologies capable of producing million of sequences at once have been developed, bringing down the cost and time of DNA sequencing. If the HGP would commence today, it is predictable that researchers would probably use the whole genome random shotgun sequencing method used by Celera. However, rather than using automated versions of the Sanger method for DNA sequencing, they would employ newer, much faster, automated DNA sequencing technologies. Three of these DNA sequencing methods used for whole genome sequencing projects today include Illumina Solexa Sequencing, pyrosequencing (also called 454 pyrosequencing or 454 sequencing) and microarray. These techniques are based on the principle of generating large numbers of unique polymerase generated colonies, also known as "polonies", which can be simultaneously sequenced. The two methods are reviewed in detail below.


Pyrosequencing or sequencing by synthesis is based on the detection of nucleotide additions by the DNA polymerase with light signals rather than chain termination with dideoxynucleotides. This method uses a chemoluminescent enzyme named luciferin which produces different light signals when the different nucleotides are added to the complementary strand produced and thus, determining the sequence of the template DNA as a series of peaks called a program (Metzker, 2005; figure 3).

An adaptation of this technique was licensed to 454 Life Sciences in which DNA fragments were amplified on beads in the droplets of an emulsion. The template-carrying beads are then loaded into the wells of a fibre optic slide to convert each into a picoliter-scale sequencing reactor in which sequencing by pyrosequencing takes place. This system has shown higher throughput, accuracy and robustness than shotgun sequencing and de novo assembling (Margulies et al., 2005).

Figure 3. Pyrosequencing or sequencing by synthesis. Repeated cycles of nucleotide addition by polymerase (left) are detected by light emission (right). The identity of the nucleotides used is shown on the X axis. The signal measured at each cycle is shown on the Y axis, distinguishing multiple incorporation events (adapted from Adams, 2008).

Solexa sequencing

This type of sequencing builds a DNA library by shearing the sample of interest to an average size of ~800bp using a compressed air device known as a nebulizer. The ends of the DNA are then polished, and two unique adapters are ligated to the fragments. Ligated fragments are then isolated via gel extraction and amplified using limited cycles of PCR in the channels of special flow cells (figure 4). At the end of the PCR, each channel contains several million copies of the sequence of interest.

This technique differs with the 454 pyrosequencing method in the way it obtains its polonies: whereas 454 pyrosequencing uses a bead-based emulsion PCR to generate them, Illumina employs a unique "bridged" amplification reaction that occurs on the surface of a flow cell, a chamber that resembles a water-tight microscope slide (Nyren, 2007).

Figure 4. The steps of Solexa sequencing. This technique generates several million dense clusters of double stranded DNA fragments in each channel of flow cells (step 9). Then, fluorescence emitted from the flow cell by the addition of labeled nucleotides by the polymerase will determine the sequence of bases in a given fragment (adapted from


Presently, the most efficient ways of performing substantial parallel sequencing is sequencing by hybridization on miniature devices known as microarrays (McKenzie et al., 1998). This method consists in the immobilization of the DNA segment to be sequenced in a microarray system, followed by its hybridization with a very large set of short, labeled probes. Finally, the pattern of hybridization is analyzed and the original DNA sequence obtained. This technique can be performed the other way around, by immobilization of thousands of short probes in a microarray and then hybridization of these short probes with the DNA target, which has been labeled previously with a fluorescent probe (Reviewed by Diamandis, 2000).

Figure 5. DNA microarray. The DNA fragments to be sequenced are fluorescently labeled and hybridized to an array platform that contains known sequences. Then strong hybridization signals detect the sequence of the DNA of interest (figure from Accessed 19/08/10)


The Human Genome Project marked a new approach in biomedical research, making the whole scientific community come together to define a large piece of biological knowledge that has changed research. However, the specific scientific plan and the feasibility of the project were unclear in 1990, and the whole project was performed in phases. Nowadays, these phases can be abolished in order to obtain more accurate information in faster and cheaper ways, consequently obtaining the desired information, i.e. the genes and their roles as well as any polymorphisms, along with the chosen sequence.

Although the human genome was deciphered years ago, there are still many barriers between this code and its final understanding. One of these barriers is the cost to sequence the human genome. The first human genome cost $3 billion to sequence. In recent years, the cost of sequencing a human genome has fallen below $10,000 for the first time (Metzker, 2010), therefore giving researchers and pharma companies the potential to transform research costs and, eventually, therapeutic strategies. In addition to cheaper techniques, advances in computer technology have also rendered the whole process cheaper, faster and more reliable. All these improvements lower the costs of gene technology, and this way it can be used to detect disease and prevent genetic disorders by moving from the lab to the doctor's office, making the understanding of disease and the finding of therapies - the initial goal of the HGP - a reality for everyone. It is also important to note that the DNA sequence unveiled by the HGP is a combined "reference genome" obtained from five individual donors. Thus, it does not represent the exact sequence of each individual. Thus, the technological advances that allow the a cheaper sequencing will provide an alternative approach to the initial HGP by identifying variant DNA units in single individuals at the same time than sequencing takes place in order to relate them to increase risk of disease.

In addition, the role of junk DNA, the evolution of the genome and individual differences - questions still being tackled by scientists all over the world - could also be found out with faster and cheaper sequencing technologies.

Other important barrier to overcome in sequencing technologies is speed. Although, the HGP obtained the first draft of the human genome project two years ahead of schedule, it still took more than ten years to be worked out. If we want to obtain information out of entire genomes, researchers must be able to sequence at much faster pace. One improvement in this area is that nowadays, the IHGSC, together with a number of organizations have awarded a set of cooperative agreements to form the National Institutes of Health (NIH) BAC Resource Network, in order to meet the need to increase the number of available BAC libraries, thus increasing the national BAC library-making capacity. This way, the BAC Resource Network will produce at least fifteen BAC libraries at 10X coverage of 'mammalian-size' genomes or the equivalent (National Human Genome Research Institute, 2010). This quantity of available BACs will make the human genome project much faster and the results more reliable.

Finally, new technologies and increasing knowledge of the genomes of other vertebrates will make the cataloguing and characterization of the functional elements of the human genome easier and more reliable, as protein-coding regions can now be unveiled through comparison of other genomes. Thus, further characterization of other genomes is also crucial for the true finale of the HGP.

Figure 6. The human genetic code deciphering will make us understand disease, patterns of behavior and the evolution of human beings