Using Dna To Solve Computational Problems Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Computational problems represent a collection of questions that computers might want to solve. The most notable characteristic of these problems is that no fast solution to them is known. As a consequence, determining whether or not it is possible to solve these problems quickly is one of the principal unsolved problems in computer science today.

Computer scientists rank computational problems in different complexity classes. The class of interest for this paper is called NP-complete (NPC). Problems within this class share two common properties:

Any given solution to the problem can be verified quickly (in polynomial time); the set of all decision problems with this property is called NP (nondeterministic polynomial time).

A problem p in NP is also in NPC if and only if every other problem in NP can be transformed quickly (in polynomial time) into p; if any single problem in NP-complete can be solved quickly, then every problem in NP can also be quickly solved.

As an example of an NPC problem, the Hamiltonian path problem will be discussed below:

In mathematics, a graph is an abstract representation of a set of objects (vertices) interconnected by links (edges). Typically, a graph is depicted as a set of dots for the vertices, joined by lines or curves for the edges.

The edges may be directed (asymmetric) or undirected (symmetric). For example, if the vertices represent people, and there is an edge between two people if they shake hands, then this is an undirected graph, because if person A shook hands with person B, then person B also shook hands with person A.

On the other hand, if there is an edge from person A to person B when person A knows person B, then this graph is directed, because knowing someone is an asymmetric relation (that is, one person knowing another person does not necessarily imply the reverse).

A Hamiltonian path  is a path in a graph which visits each vertex exactly once (Fig. 1). Determining whether such paths and exist in graphs is the Hamiltonian path problem.

Fig. 1 Hamiltonian path

At present, all known computer algorithms for NP-complete problems require an unreasonable amount of time, which increases very quickly as the size of the problem grows. As a result, the time required to solve even moderately large versions of many of these problems easily reaches into the billions or trillions of years, using any amount of computing power available today. Methods for computing solutions using a reasonable amount of time remain undiscovered.

In 1994 however, Adleman et al. solved an instance of the Hamiltonian path problem by using the tools of molecular biology instead of a computer, thus demonstrating the potential power of parallel, high-density computation by DNA molecules in solution. How DNA can be used to solve computational problems will be discussed in the next section.

2. DNA Model of Computation

2.1 Structure of DNA

DNA (deoxyribonucleic acid) is a molecule present in cells of all known living organisms and some viruses, that contains the genetic code to construct other components of cells, such as proteins and RNA molecules.

DNA consists of two long polymers, entwined like vines in the shape of a double helix. Each polymer is made from repeating units called nucleotides. In a double helix the direction of the nucleotides in one strand is opposite to their direction in the other strand: the strands are antiparallel. The asymmetric ends of DNA strands are called the 5′ and 3′ ends.

The nucleotide repeats contain both a backbone-segment (made of sugars and phosphate groups) and a base, which interacts with the other DNA strand in the helix. There are four types of bases found in DNA; adenine (A), cytosine (C), guanine (G) and thymine (T) (Fig.2).

Two nucleotides on opposite complementary DNA strands are connected via hydrogen bonds between bases. In DNA, adenine (A) always forms a base pair with thymine (T), as does guanine (G) with cytosine (C) (Fig.2), also known as Watson-Crick base pairing.

It is the sequence of these base pairs along the backbone that encodes information. The code is read by copying stretches of DNA into the related nucleic acid RNA (transcription). The resulting RNA molecule specifies the sequence of the amino acids within proteins, during a process known as translation.

Fig.2 DNA structure.

2.2 DNA Model of Computation

As introduced by Lipton in [12], the DNA model of computation is just a sequence of test tubes. Each test tube contains many strands of DNA that encode certain computations, and each subsequent test tube is created from earlier ones by some biological operation.

To explain the various operations to be performed on DNA, a simple notation is introduced. Given a string x over the alphabet {A,C, G,T}, ↑x represents the 5'-3' DNA-strand which is made up of the letters of x. ↓x is the Watson-Crick complement of the strand ↑x, thus the 3'-5' strand. When ↓x and ↑x anneal to each other they form a double strand denoted by ↕x.


↑TGGACG represents the single stranded DNA molecule 5'-TGGACG-3'.

↓TGGACG represents the single stranded DNA molecule 3'-ACCTGC-5'.

↕TGGACG represents the double stranded DNA molecule 5'-TGGACG-3'


2.2.1 Extraction

Sometimes the ability is needed to extract (from a test tube) all strands containing any specific short nucleotide sequence. To accomplish this, the method of biotin-avidin affinity purification can be used, as described below:

When all strands containing the sequence ↑x have to be extracted, the first thing to do is create many copies of its complementary oligonucleotide (a short DNA strand), namely ↓x. In the next step, every oligo is attached to a biotin molecule, which is in turn anchored to an avidin bead matrix.

The double strands from the test tube have to be dissolved into single strands by heating the solution, a process referred to as melting.When poured over the matrix, those single strands that contain ↑x will anneal to the ↓x oligos anchored to the matrix. A simple procedure will wash away all strands that did not anneal, leaving behind only those strands that contain ↑x, which can then be retrieved from the matrix. They are converted to double strands by PCR (see below), if needed.

2.2.2 Polymerization with DNA Polymerase

The enzyme DNA polymerase can be used to create the Watson-Crick complementary strand to any given single strand of DNA. DNA polymerase will "read" the given strand, called the template strand, in the 3'-5' direction and build the complementary strand in the 5'-3' direction, one nucleotide at a time.

DNA polymerase does however require a primer; a short portion of the template that is double stranded. The enzyme will add the new nucleotides onto the end of this short complementary piece.

For example, with a given strand ↑xyz, DNA polymerase cannot create its complement. However, when ↓z is added to the solution, it will anneal to ↑xyz. Thus, ↑xy ↕z is obtained, and DNA polymerase will be able to add nucleotides onto the free 3' end of z to create ↕xyz.

2.2.3 Amplification with the Polymerase Chain Reaction (PCR)

PCR is a process that uses DNA polymerase to make many copies of a DNA sequence. With a given duplex ↕xyz, the first step is to melt it to form ↑xyz and ↓xyz. To this solution, the primer oligos ↓z and ↑x have to be added, which will anneal to form the partial duplexes ↑xy↕z and ↕x↓yz.

DNA polymerase can then elongate the primers to create full duplexes of the form ↕xyz. The result of one PCR-cycle is two copies of the original strand. By repeating the process, the number of copies of the original strand will be doubled with every cycle, and the DNA template is exponentially amplified.

2.2.4 Elongation of DNA

Sometimes strands in a test tube will need to be elongated by tacking another short strand onto the end. If every strand in the tube is of the form ↕Xy, where X is arbitrary and y is fixed, then this elongation can be accomplished as follows.

Firstly, an extract on ↑y is performed (see 2.2.1), leaving behind all the "top" strands of every pair; every ↑Xy is extracted and every ↓Xy is discarded.

In the next step, many copies of the single strand ↓yz are introduced into the solution, and allowed to anneal with the ↑Xy strands. This results in partial duplexes of the form ↑X ↕y ↓z.

DNA polymerase can subsequently be used to fill in the rest of the duplex, thus creating the full duplex strands ↕Xyz. The result is an elongation of every strand by the addition of another short strand, in this case ↕z.

Separation of DNA strands with gel electrophoresis

Gel electrophoresis is a technique used for the separation of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or protein molecules using an electric field applied to a gel matrix.

DNA molecules are negatively charged. When placed under an electrical field, they will migrate through the gel towards the positive pole at rates dependent upon their sizes; a small DNA molecule can thread its way through the gel easily and hence migrates faster than a larger molecule.

The next section will discuss how these biological operations can be used to solve computational problems.

2.3 The Travelling Salesperson problem (TSP): example of a DNA algorithm

To be tackled below is a variant of the Hamiltonian path problem; given a graph consisting of vertices (cities) linked by binary paths (roads), find the shortest route (series of binary paths) between two given cities. This problem is called "Travelling Salesperson Problem (TSP)" and implies that the binary paths are labelled with distances.

A DNA algorithm for solving the TSP is presented below.

Fig.3 A map consisting of four vertices (cities A, B, C and S) and the paths between them with labeled distances (n=1, 2 or 4).

To determine the shortest route between two cities, a series of steps has to be followed (adapted from Adleman's DNA algorithm):

Create a unique DNA sequence for each city (extraction, see 2.2.1): S=AAAA; A= CCCC; B=GGGG and C= TTTT (Fig.3). To keep it simple, clearly distinguishable DNA codes are used in this example. Keep in mind that the use of complementary codes (i.e. CCCC is complementary to GGGG) cannot be allowed in real DNA computation, since paths containing these complementary codes will be attracted to each other in test tubes, resulting in badly formed DNA strands.

Construct DNA representations of binary paths between two cities X and Y, where n is the distance between the two cities:

(a) For the binary path Xƒ Y, join n occurrences of the DNA sequence for X with one occurrence of the DNA sequence for Y.

(b) For the binary path Yƒ X, join n occurrences of the DNA sequence for Y with one occurrence of the DNA sequence for X.

(c) Form a complementary strand where the join takes place to hold the binary path together.

For instance, the binary path Sƒ B (Fig.3) has four occurrences of AAAA to represent the path length 4, followed by one occurrence of GGGG (Fig.4).

To form longer routes out of binary paths, splice two binary paths P1 and P2 together if and only if the final city code for 1 matches the first city node for P2. Delete half of the code of the final city node in 1 and half of the code of the first city in P2 (which are matched).

For instance, the binary path Sƒ Bƒ A is formed as follows:





where the last two bases of Sƒ B and the first two bases of Bƒ A are deleted because of the match.

To prevent loops, do not form longer routes if the final city in P2 matches the first city in P1.

Repeat the above two steps until no more routes can be formed (Fig.4).

Place all the DNA sequences produced so far into a test tube.

Extract all those routes which start with the DNA code of the desired start city and end with the DNA code of the desired destination city and place in a separate test tube. In this case, the strands for Sƒ Bƒ C and Sƒ Aƒ Bƒ C are extracted.

Sort all the remaining routes by length (gel electrophoresis, see 2.2.5).

The shortest DNA sequence represents the shortest route between the desired start and destination cities. Also, the sequence of DNA codes in the shortest strand provides information as to the order in which cities are encountered, and the length of the route can be calculated as follows: (total length of shortest strand / number of bases for each city) minus 1.

In this case Sƒ Aƒ Bƒ C, which is 24 bases long (as opposed to 28 for the other route), represents the shortest route. This strand contains within it the order in which the cities are to be visited (reading left to right), and total length can be calculated as the total number of bases (24) divided by the length of each city (4 bases ),minus 1, resulting in route length 5.


3. Conclusion on DNA Computing

In 1994 Adleman [1] first proved that tools of molecular biology can be used to solve instances of the famous Hamiltonian path problem . Recall that this problem is: given a set of "cities" and directed paths between them, find a directed tour that starts at a given city, ends at a given city, and visits every other city exactly once. This problem (HPP) is known to be NP-complete.

One of the major achievements of Computer Science in the last two decades is the growing evidence that no general efficient solution exists for any NP-complete problem. Thus, Adleman's result that HPP can be solved by a DNA based biological experiment is exciting.

However, it does not mean that all instances of NP problems can be solved in a feasible sense. The way Adleman solved the HPP is totally brute force; he designed a biological system that "tries" all possible tours of the given cities. The difficulty is that even with 10^22 parallel computers one cannot try all tours for a problem with 100 cities. The brute force algorithm is simply too inefficient.

Biological computers can however solve any HPP of 70 edges or less. A practical issue is that there does not seem to be a great need to solve such HPP's. It appears possible to routinely solve much larger HPP's on conventional machines. In fact, all the techniques that are used so far can be simulated on classical parallel machines with the number of processors proportional to the number of strands.

The speed of any computer, biological or not, is determined by two factors: 1. how many parallel processes it has, 2. how many steps each can perform per unit of time.

The exciting point about biology is that the first of these factors can be very large; biological computations could potentially have vastly more parallelism than conventional ones. The huge parallelism inherent in DNA based computing, has the potential to yield vast speedups over conventional electronic based computers for computational problems.

The trade-off is that DNA computers require exponentially increasing volumes of DNA. Due to physical limitations, the number of strands is limited, and hence this means that DNA can help to solve only some instances of corresponding size (problem of scale).

For the TSP problem, the point is that the algorithm doesn't realistically scale up from the example described in section 2.3. When the space of possible routes grows exponentially, huge amounts of DNA are required to form the initial set of routes. Hartmanis (1995) calculated that adopting Adleman's DNA algorithm to the TSP problem would require a mass of DNA greater than the earth to solve a 200-city problem.

The second of these factors, however, is much more in favor of conventional electronic computers; a state of the art machine can easily do 100 million instructions per second (MIPS). On the other hand, a biological machine seems to be limited to just a small fraction of biological operations per second (BOPS).

Thus, the major problem currently for the DNA computing experiments described here, is the time involved in these operations. While DNA processes within the test tube can take place millions of times per second (due to huge parallelism), biological operations (extraction, electroforesis, …) can take several hours and even days, even for simple problems.

Example: for the TSP (see section 2.3), where the algorithm is basically an extension of Adleman's method, it has been conservatively estimated that to compute a shortest route, given a start and end node, which visits each city just once in a 30 city, fully interconnected network will take several million years, even assuming a billion instructions executed per second.

Another issue involving DNA computation was pointed out by Adleman himself. The biological operations used in the experiments, are presented as an idealized model which assumes that all the operations are error-free. However, nucleotides (bases) degrade over time during separation, amplification, … Thus, there is an increasing error-proneness of DNA computations during biological operations, and the more strands there are the more chance there is that the result of the DNA computation is incorrect.

As a result, several researchers (e.g. Amos et al., 1997) have concluded that the complexity aspects of DNA algorithms will limit their applicability. This conclusion, however, might be a little too voorbarig. It has already been shown that biological experiments can be used to vastly speed up many important computations, but there is great need for future research.

Current research in DNA computing uses DNA as a data structure ('representational DNA'), but any algorithm which only assumes manual manipulation of data representations is unlikely to fare well in terms of time taken to produce a result. Thus, the issue is whether all the steps involved in algorithms for manipulating representational DNA can themselves be 'automated'.

Automated DNA can be achieved:

by encoding algorithmic processes (those normally achieved through human intervention) as executable DNA strands which, when transcribed into RNA and mapped onto enzymes within the test tube, can manipulate the representational DNA

by introducing ready-made enzymes from outside to be combined in the same test tube as the representational DNA, so that operations currently executed manually outside the test tube take place within the test tube.

These two processes are roughly similar to those cellular processes in which a cell's DNA produces proteins and enzymes for use by itself (intracellular), and by other cells (extracellular), respectively.

Thus, it is proposed that new DNA algorithms, resembling in vivo processes in which the initial data pool need not contain every possible final answer, are needed.

Furthermore, because of the huge information storage capacity of DNA (1 mmol of DNA can encode 2 gigabytes of information), there is a need for rapid and accurate data acces to take advantage of the massive parallelism. It is suggested that an automatic device is needed to accelerate readout and take advantage of the massive parallelism.