Chapter 1

Introduction and Background

1.1 Related Work and Project Concept

In the late 1980s Nadrian C. Seeman at New York University, founded the .eld of Structural DNA Nanotechnology [ACS01]. Since then many other labs have started exploring the possibilities of using DNA as structural material. Various di.erent DNA machines have been constructed [NNT02]. Examples of these machines include; the "tweezers" [NNT01], the "walker" [NAN01], the "rickettsia motif" [NNT03], the "lockbox" [NAT01], and the proposed "stack" [CHA01] should also be mentioned. Some of the structures present in these machines have been replicated using the resulting software from this project.

Some software that aim to design DNA structures include; caDNAno [CAD01], NanoEngineer-1 [NEN01], NUPACK [NUP01], SARSE [SAR01], and UNAFold [NAR01]. Most of these software packages are used for designing DNA origami structures, a concept discovered by Paul Rothemund [ROT01] [NAT02].

The software in this project was written based on Jessica P. Changs topological modelling of DNA [CHA01]. This abstraction of DNA behavior lay the foundation for a technique to perform quick test tube simulations of mixing DNA strands. The second objective of the software is to sequence the construction of a designed DNA structure. By mixing DNA strands in a test tube they will combine and attach to each other in particular con.gurations. Depending on what DNA strands are mixed, what quantities, and in which order, the resulting DNA structures will be di.erent.

Starting with an empty test tube, referred to here as a vessel, di.erent fuel strands can be added to produce vessels with di.erent contents. In the process of developing a method for planing the construction of designed DNA structures, a vessel can be seen as a node in a directed graph, and the resulting vessels from adding fuel strands, can be seen as neighboring nodes with edges in the direction from the original node, to the neighboring nodes.

add title to each pages

The idea for the objective of constructing DNA structures, was to use the A* algorithm to search this state space of vessels. A state could be de.ned as a vessel containing the designed DNA structure. The heuristic function needed for the A* algorithm would then be developed to estimate the number of additional fuel strands that would have to be added in order to produce a vessel containing the goal structure.

1.2 Introduction to DNA

DNA consists of four di.erent bases; Adenine, Guanine, Cytosine,and Thymine, commonly referred to as A, G, C, T [NHG01]. A base liked to a sugar is called a nucleoside. The backbone of the DNA strand is made from nucleosides joined together by phosphate. The phosphate form asymmetric bonds between the third and .fth carbon atoms of the adjacent sugar rings. This asymmetric bond means the strand has a direction, usually described as going from the 5 (.ve prime) end to the 3 (three prime) end. Such a sequence of alternating phosphate and sugar with an attached base, is referred to as single stranded DNA, or simplex DNA.

The DNA double helix is formed when two strands of simplex DNA are connected anti-parallel, i.e. they are connected in parallel and in opposite directions. It is the bases that provide the connection between strands. This connection is shown in Figure 1.1. For the connection to be successful each base on the .rst strand have to match the connected-to-base on the second strand. A connection can only take place between A and T, and between C and G. For example a simplex strand of sequence GATTACA (from 3 to 5), can only connect, or anneal, to the inverse sequence TGTAATC (from 3 to 5).

The phenomenon of branch migration occurs when two identical segments of single stranded DNA compete over the annealing to an inverse segment of simplex DNA. An example of this is the duplex strand migration taking place in a Holliday junction, shown in Figure 1.2, where two pairs of strands with inverse segments of simplex DNA are involved [GEN01]. Branch migration involving only one segment and two inverse segments is called simplex branch migration.

1.3 DNA Abstraction

The abstraction of simplex DNA proposed by Jessica P. Chang [CHA01] provide the concept which this project is based on. A complete language for describing the intrinsic topology of DNA complexes is proposed, along with describing DNA complex transitions in terms of this representation.

A description of a portion of the proposed ascii notation syntax follows. The sequence of bases in a strand of simplex DNA is grouped into segments. A segment is represented by a single letter e.g. [a]. The complement segment of simplex DNA of [a] is denoted by the opposite letter casing [A].A simplex DNA strand can be described as one or more segments e.g. [a b c]. The segments are said to be ligated. [a b c] implies the direction from 3 end to 5 end, i.e. the 5 end of [a] is ligated to the 3 end of [b],and the 5 end of [b] is ligated to the 5 end of [c]. If a simplex DNA strand consists of a repeated sequence of bases, only one label is needed to describe this sequence but an additional index is needed to address which sequence is referred to, e.g. a strand consisting of two repeated base sequences could be described by [a0 a1]. Note that there is nothing chemically distinguishing one ligated segment to the next. A segment is here simply a reference to a set of DNA bases in a sequence.

The annealing of two inverse segments is indicated by [], e.g. two annealed inverse segments on the same strand can be described by [aA Aa]. If the two annealed segments are not ligated, the beginning of a new strand is indicated by [|], e.g. [aA | Aa]. The term closed component,or here just component, is borrowed from topology and is used to refer to a set of simplex DNA strands connected to each other in some con.guration. Figure 1.3 shows a graphical representation of the example component [a0A0 a1 A0a0 b0B0 | B0b0 c0].

The second part of the abstraction describe a limited number of transitions that components can undergo. Two unannealed inverse segments can anneal to each other. Two annealed segments can separate.An anneal between two segments on two di.erent components is referred to as an inter-component-anneal,and an anneal between two segments on the same component is referred to as an intra-component-anneal. An example of an inter-component-anneal transition is component [a0] and component [A0] transitioning into [a0A0 | A0a0]. An example of an intra-componentanneal transition is [a0 A0] transitioning into [a0A0 A0a0].

The connection between two ligated segments is broken when they cleave, and a segment with no other segment ligated at the 3 end can ligate to a segment with no other segment ligated at the 5 end. This implies breaking and creating the covalent bond along the DNA backbone. These transitions along with the separate transition, do not happen spontaneously but are performed only by proteins. An example of a ligate transition is component [a] and component [b] transitioning into [a b]

Simplex branch migration taking place on a component is referred to as a displace transition, and duplex branch migration is referred to as an exchange transition. An example of a displace transition is shown in Figure 1.4 where [a0A0 A0a0 a1] transitions into [a0 A0a1 a1A0].An example of an exchange transition is shown in Figure 1.5 where component [a0A1 A0a1 | a1A0 A1a0] transitions into two components of [a0A0 A0a0].

1.4 A*

A* is a heuristics-based graph search algorithm for .nding the shortest path between two nodes [IEE01]. It is commonly used for when there exists a method of estimating the distance between two nodes, and for when there is very large state space. Traditional shortest path algorithms, such as Dijkstras algorithm [NUM01] are impractical to use for solving problems with a too large state space. In the A* algorithm the available heuristics method of estimating distance between nodes, is taken advantage of to speed up the search. Breadth-.rst search, depth-.rst search, and Dijkstras algorithm are special instances of A*.

For every node, A* needs to know its available neighboring nodes, the cost traveled from the initial node, and the estimated remaining cost to the goal node. A closed set and an open set is used to keep track of visited nodes and found nodes. That is, visited nodes in the closed set have been expanded, and their neighboring nodes have been added to the open set.

Cost traveled from the initial node, and estimated remaining cost to the goal node, are by convention referred to as follows.

g(x) = cost of travel from the start node to node x h(x) = estimated remaining cost from node x to the goal node f(x) = g(x) + h(x)

For each A* iteration an unvisited node is selected and removed from the open set, its neighbors are added to the open set, and .nally it is added to the closed set. A node is selected from the open set by choosing the node with the lowest value of f(x). Iteration stops when a goal node with an f(x) value lower than the f(x) value of all nodes in the open set, has been discovered.

An estimation of the remaining cost of travel to the goal node which never over estimates the actual remaining cost, is said to be an admissible heuristic. With an admissible heuristic A* is guaranteed to return the optimal path. Because the termination criteria states that the returned goal node have a f(x) lower than all nodes in the open set, and if all nodes have an optimistic non-overestimating heuristic, none of the remaining nodes in the open set can lie in a path with lower cost than the found optimal path.

Chapter 2

Design and Implementation

Implementation of the project was done in Python. Python was chosen due to previous experience with the language, its clear syntax, and because performance was initially not believed to be an issue.

Independent from the A* mechanics is the DNA related code. In accordance with the Acyclic Dependencies Principle [OBJ01] the DNA mechanics is implemented in a unidirectional dependency hierarchy. A vessel is composed of components with associated concentrations, and a component is composed of segments, connected in a particular con.guration. The DNA A* instance inherits from the A* abstract base class and use vessels to de.ne A* nodes.

Also part of the DNA package is a graphics module for drawing the DNA structures, and a utilities module for various useful functionality, including some debugging tools.

2.1 Segment

A segment contains information about a DNA segment and holds the corresponding label. It also contains the information on which other segments (if any) it is ligated to, at the 3 end and the 5 end, and which segment it is annealed to. A segment performs the basic transitions; anneal, separate, displace, exchange, ligate, and cleave, and in addition holds a couple of more functionalities that are worth mentioning.

2.1.1 Node Iteration

In the task of simulating the behavior of single stranded DNA it is important to keep good track of the connectivity between segments. A component node is de.ned by all segments with one (or both) end(s) at that node. Two ligated segments are said to be ligated at the node of which they are connected. The 3 end of one segment and the 5 end of its annealed segment are said to belong to the same node. One node can thus theoretically connect an in.nite number of segments, in a "pom-pom" structure like the one shown in Figure 2.2.

The desired information about a node, speci.ed by a segment and the segment end, is all segments that share this node, and their directions. An annealed segment is said to be anti-parallel to the speci.ed segment. A segment ligated to the speci.ed segment, at the speci.ed segment end, is also said to be anti-parallel (from the perspective of the reference node they run in di.erent directions).

In the example component depicted in Figure 2.2 all segments with a lower case label are parallel to each other with respect to the middle node, as they are all directed toward that node. All segments with upper case labels are thus also all parallel to each other, but anti-parallel to the segments with lower case labels.

The algorithm for node iteration works by .rst returning the starting node and declaring it parallel. If the 3 end is the speci.ed segment end for node iteration, the next segment returned is the segment ligated at the 3 end, and it is declared anti-parallel. The next segment returned is the ligated segments annealed segment, which is again parallel. This process continues until either iteration returns to the starting segment or there are no more connected segments. The same procedure is then performed in the opposite direction. The starting segments annealed segment is returned, and it is declared anti-parallel, the segment ligated at the annealed segments 5 end is then returned, and so on.

With the example of the "pom-pom" structure in Figure 2.2, if for example the node is speci.ed by the 3 end of segment A0, the iterator will return A0:parallel, e0:anti-parallel, E0:parallel, d0:anti-parallel... and so on. If instead the node is speci.ed by the 5 end of A0, the iterator will .rst return A0:parallel, then continue in the reverse direction since no segment is ligated at the 5 end of A0, and return a0:anti-parallel. Iteration then stops since there is no ligated segment to a0 at the 3 end.

An interesting situation is when there is a segment with both ends at the same node. When for example three segments are ligated together, and the .rst and the last segment are annealed, this situation occurs. The looped segment is then returned both as being parallel and anti-parallel to the starting segment.

In the example with the component shown in Figure 2.3 all segments will be returned as being both parallel and anti-parallel, at both 3 and 5 end, of both b0 and B0. A node iterator de.ned by the 5 end of a0,the parallel segments returned will be b0, B0, C0, and c0, and the anti-parallel segments returned will be b0, B0, A0, and c1.

2.1.2 Segment Comparison

As shall be seen later, being able to compare segments is important for creating a unique representation of a component. Two segments only compare equal if they have the same label, and the same neighbors, and the same neighbors neighbors, and so on. If two segments compare equal, it implies that they are surrounded by identical connectivity, i.e. they are identical segments, in identical positions, on identical components. Two di.erent segments on the same component can still compare equal however if the component is symmetrical. The segments a0 and a1, and segments A0 and A1 in the component shown in Figure 2.4 are good examples of segments that belong to the same component and compare equal.

The problem of comparing two segments is solved by using a breadth-.rst component iterator, and comparing each visited segment locally. A breadth-.rst component iterator of a segment iterates over all connected segments, by starting with the closest neighbors, then the neighbors neighbors, and so on. The segments returned by the iterators of the two segments being compared, are then compared on a local level. Local comparison is performed by comparing the two segments labels, the existence of an annealed segment, the annealed segments label, the existence of a ligated segment at the 3 end, the ligated segments label and an equivalent comparison at the 5 end.

Iteration continue on both iterators until either two segments compare di.erent on a local level, or until one iterator runs out of elements and the other doesnt, in which case the two original segments compare di.erent on a global scale. If all pairs of segments returned from the two iterators compare equal locally until both iterators run out of elements on the same iteration, the segments compare equal.

The two segments with label A0 in Figure 2.5 need four iteration steps to determine that they compare di.erent. Using the legend [label: annealed segment label: 5 end segment label: 3 end segment label],the .rst two segments [A0:a0:C0:None] returned by the two component iterators, compare equal. The following two pairs of segments, [a0:A0:b0:None] and [C0:None:None:A0], also compare equal. On the fourth iteration, b0 on the di.erent components compare di.erent locally, with [b0:B0:None:a0] and [b0:None:None:a0] not being identical.

2.2 Component

A component is a collection of segments that are all connected. When a separating transition such as separate or displace is performed on segments of one component, the component invariant of holding only associated segments that are connected to each other, must be maintained. In the case of a separating transition, when one component splits into two, new component instances are formed.

The problem of .nding a unique representation of a component is nontrivial, and it is the process that consumes the majority of execution time in the end product. The complexity of the of the problem might be better understood when considering the de.nition of uniqueness. Only identical components hold identical representations, and identical components must hold identical representations.

2.2.1 Unique Representation

In order to achieve a unique representation of a component, the associated segments must be listed in a unique order, and information about their connectivity must be included. If the segments are not ordered uniquely then identical components might list their segments in di.erent orders. If the connectivity of the segments is not included, then di.erent components with segments of identical labels but with di.ering connectivity might return identical representations.

A unique ordering of segments can be achieved by sorting all segments, selecting the .rst segment in the sorted list, and using that segments method for iterating over all its connected segments. That is, in order to get a unique order of segments the di.culty lies in selecting one unique segment. This is also the only the hardest part of comparing two components. Once one segment from each of the two components being compared have been uniquely selected, testing for equality between the components is the same as testing for equality between the uniquely selected segments.

The problem of sorting a list of segments is that it can be costly, as comparing two segments can be costly if they are very similar. This was previously discussed in Section 2.1.2. Especially if the component is large, and there are many similar segments in the list to be sorted, this can be a major problem. A screening process was therefore developed to reduce the size of this list.

In the segment screening process, subsets for di.erent segment properties are formed, and segments with the appropriate properties are added to these sets. For example there is a subset for all segments with an annealed segment, and there is a subset for all segments with a particular label. An annealed segment with label a, and ligated at the three end, would then be added to the annealed subset, the label-"a" subset, the ligated-at-3-end subset, and the not-ligated-at-5-end subset. These subsets can then further be combined by creating subsets of two or more original subsets by taking their intersection. For example such a derived subset is the subset for the annealed but not-ligated-at-3-end segments, and the subset for the unannealed segments with label b. The smallest subset is then selected, and sorting is performed only on this smaller set of segments.

If only one segment property is considered when examining the component shown in Figure 2.6, then there are at least as many as seven of this same property. There are for example seven segments with label a, there are eight unannealed segments, and eight segments with no ligated segment at the 3 end. When looking at more properties at the same time however only segment A3, A4,and A10 have the property of having label A, the property of being annealed, and being ligated at both the 3 end and the 5 end. This means that sorting only need to be performed on these 3 segments to select a unique segment for the component.

If there are multiple subsets of the smallest subset size, then the subset is chosen by a precedence rule. Also worth noting is that as the number of ways allowed to make the selection for the subset combination increase, the number of resulting subsets from intersections will increase exponentially, according to Equation (2.1). For example if only combinations by selecting 2 subsets from 10 available, is made, the number of resulting combinations is 45 plus the original 10. If all combinations from combining 3 from 10 is also made, there will be an additional 120 combinations (assuming there are no empty intersections).

.n. n! k = k!(n - k)! (2.1)

To describe the connectivity of the segments, they are all given an index so that two segments with identical labels can be distinguished from each other. A segment index is easily obtained by looking up the index of the segment in the uniquely ordered list of segments. A unique component representation is then achieved by iterating over all segments in a unique order, and for every segment describing its connectivity to its closest neighbors, i.e. any annealed segment, and any segment ligated at 3 end and 5 end.

2.2.2 Perform Node Initiated Transitions

Some components can be described to consist of con.gurations that are not very likely to remain for very long. In order to create a realistic simulation of the behavior of DNA, a method to perform all very likely and spontaneous transitions in a component, is needed. For example, two ligated segments that are each others inverses, are assumed to instantaneously fold up and anneal to each other.

There are three di.erent so called node-initiated-transitions. These are the displace,the exchange,and the node-initiated-anneal (just described). What they all have in common is that the point where the transition starts is located at a node. Since the bases on the DNA strand are located right next to each other, they are assumed to instantaneously create a link. In the case of the node-initiated-anneal, when two unannealed segments are ligated to each other, and they are each others inverses, it is assumed that they instantaneously anneal to each other. The displace and the exchange transition can be described similarly.

When a potential displace transition is discovered it is assumed take place instantaneously. There are two things that can happen next. Either the displaced segment did not have a secondary connection to the component, and separates from the component, Figure 2.7, or the displace goes in reverse and transitions back into the original structure. In the latter case there is a toggle situation where the exact state of the component is undecided and can be represented by either con.guration. This is shown in Figure 1.4 on page 13.

The situation of the exchange transition is analogous to that of the displace transition. Figure 1.5 on page 13 shows the event of a separating exchange transition. If one is available then it is assumed to happen instantaneously. If the involved segments lose connectivity they are separated, otherwise there is a toggle situation.

A performed node-initiated-transition can also lead to other new potential node-initiated-transitions. This is the case for example with four ligated segments, shown in Figure 2.8. The .rst and last segments are each others inverses, and so are the two middle segments. The .rst node-initiated-anneal takes place between the two middle segments, b0 and B0. After this transition the .rst and last segments, segment a0 and A0, are now in position for a potential node-initiated-anneal, as the previous anneal transition brought the 5 end of a0, and the 3 end of A0 together.

Displace and exchange transitions, can also lead to new potential displaces and exchange con.gurations. This can potentially be hazardous for a simulation when many toggles are available on a component. For n disjoint toggles on a component there are at least n2 di.erent con.gurations of the same component. In Figure 2.9 this is illustrated where one component toggles between .ve di.erent states.

2.3 Vessel

A vessel is a collection of components with associated concentrations. The main job for a vessel is to stabilize its contents, and to keep track of the concentrations of all the di.erent components present. A typical vessel that is not considered to be stable, may contain components with potential node-initiated-transitions, components with potential intra-component-anneals, and components that can perform inter-component-anneals by annealing to each other.

2.3.1 Creation of New Components

If there is a component present with n unnannealed segments of a particular label, and another component with m unnanealed segments with the inverse label, there are nm potential inter-component-anneals that can take place. In the example with the components shown in Figure 2.10, there are six di.erent ways they can anneal. Either of the segments a0 and a1, can anneal to any of the segments A0, A1,and A2.

After one such iteration of the vessels stabilize method, the content of the vessel can still not be considered stable. There are many more components that can be formed, a potentially in.nite number of components even. After one iteration of components annealing to each other, the newly formed components can further anneal to other newly formed components, or to components left unannealed from the previous iteration. In the worst case scenario, new components can be formed at every iteration. It is therefore very important to keep track of the concentration of a component in the vessel.

In the example shown in Figure 2.11 a vessel containing the components [a0 b0] and [A0 B0] have started to stabilize and form many new components. It can be seen that the longest present component can potentially anneal to another copy of itself, implying that the length of the longest present component in the vessel, doubles after every iteration of stabilize.

2.3.2 Minimum Component Concentration

To prevent a never-ending iteration of the vessels stabilize method, it is necessary to introduce a limit to execution time and to ignore the components of lowest concentration. The value of minimum-component-concentration is not only used to ignore components of concentrations below this threshold, but also to determine when a vessel is considered stable. When the change of concentration of all components between two iterations of the stabilize method, is lower than the minimum allowed concentration value, iteration stops and the vessel is declared stable.

2.3.3 Null Intra and Inter Transitions

For every step of the stabilizing iteration process, there are three di.erent things that can happen to an individual component in a vessel. It can either remain unchanged, or it can transition into a di.erent component by no external cause, (e.g. two unannealed inverse segments on the component could anneal to each other, or it could perform a displace transition), or it can transition into a di.erent component by annealing to another component,

(i.e. an unannealed segment on the component anneals to an unannealed inverse segment on another component). It is an impossible task to decide exactly what proportion of a component that remain unchanged (performs a null-transition), what proportion transitions without any external impact (performs an intra-component-transition), and what proportion transitions by annealing to another component (an inter-component-transition). This data can only be obtained by lab experiments. Some of the factors that are involved are vessel dilution, temperature and pH. Figure 2.12 illustrates the possible transitions that a component can undergo.

The reason the null-transitions are important is to allow for components formed at a later iteration to interact with components that have been left unchanged from an earlier iteration. If all quantity of a component transition into another component at an iteration of stabilize, then many potential transitions that take place in actual lab experiments, are left out.

The assumption is made that a component is more likely to perform a potential intra-component-transition than an inter-component-transition. This has been con.rmed by lab experiments, and is due to the random movement of components. As components .oat freely in a test tube, they twist and they turn. A segment on one component is more likely to collide with a segment on the same component, than to collide with a segment on a di.erent component. This is why the concentration of the contents in a vessel is important in determining these proportions. The more dilute the contents of a vessel, the greater the distance between components. Components in a more dilute solution are therefore less likely to anneal to each other and and intra-component-anneals will be more likely.

2.3.4 The Transition-Weight-Problem

For inter-component-transitions, determining what quantity of a component that will anneal to another is not trivial. In the example of the three components shown in Figure 2.13, two di.erent anneals can take place. The "A" component can anneal to either the "B" component or the "C" component. If the concentration of the "B" and the "C" component is the same, it is fair to assume that each anneal is equally likely. If for example the "B" component is of higher concentration than the "C" component, it is also fair to assume that an anneal between components "A" and "B" is more likely than an anneal between components "A" and "C".

The situation easily gets more complicated when more and longer components are involved, and they all have di.erent concentrations. How is the weight of each potential anneal transition determined, when for example another component with an unannealed segment of label d is added, and all components have di.erent quantities? One property of the resulting transition-weights that must hold however, is that one component should be more likely to anneal to another component of higher concentration than a component of a lower concentration. Without this property, it might be that half the quantity of the "A" component anneal to the "B" component, and the other half to the "C" component, even though the concentration of the "C" component might be a thousand times higher than the concentration of the other components.

A.rst approachat tryingtosolvethis transition-weight-problem is to try to solve it by a system of equations. The problem of this approach however, is that the number of unknowns in the system quickly becomes very large with many components. Given that these calculations are only estimations of a real lab experiment, it is unnecessary to waste precious execution time on complex equation solving.

The preferred quick and easy solution is to the problem of deciding what quantity of one component that anneal to another, is to multiply their quantities and divide by the total quantity of components in the vessel. Equation (2.2) shows this calculation. This method guarantees that the sum of all reserved quantity of a component that will anneal to other components, will never exceed the total quantity of that component. Also guaranteed is that one component is more likely to anneal to a component of higher concentration than a component of lower concentration.

qiqj

qij = . (2.2) qk k

. ..

..qiqj = qi .j qj = qi (2.3)

k qkk qkj

The problem with this method however, is that in most situations there will be unused quantity of a component that have not been reserved for an annealing to another component. In the common scenario all components in a vessel can not anneal to all other components. In Equation (2.3) it is shown that if the quantity of all anneal-transitions of component i to all other components is summed up, this value equals the total quantity of component i. This is to be expected if component i can in fact anneal to all other components. If two components i and j can not anneal, then the total quantity of component i will be higher than the total number of transitions that component i is involved with. The resulting e.ect is that more null-transitions are created.

2.4 A* abc

The A* mechanics is implemented such that it is independent of the DNA mechanics. A few modi.cations were needed for the classical A* method to adapt to our problem. The goal of the implementation was to be able to still run A* in a classical sense, but the user should also be able to run an instance with these modi.cations by specifying the appropriate parameters.

The basic idea behind using A* for DNA simulation to build particular structures, was to translate the workings of experiments in a DNA lab to the workings of A*. The state of a test tube could be translated into an A* state. The the mixing of test tubes, or the addition of fuel component or fuel strand, could be translated into the A* get_neighbors() function. Finally a test tube containing the desired goal component could be translated into an A* goal state.

2.4.1 A* Modi.cations

An A* adaptation to our problem is necessary for two reasons. The .rst reason is because of an unclear state de.nition. Many di.erent ways of de.ning the state of a DNA instance of A* have been considered. Given that the goal of the project was to create a particular component, at .rst it seemed natural to de.ne an A* state to simply be a component. It was soon realized that if the resulting goal path, from a start component to a goal component, was going to be of any practical use, the entire content of the test tube would have to be considered as other components in the test tube could potentially a.ect this one component de.ning the A* state.

The idea of a simulated vessel that would stabilize its contents, came out of the realization that the entire content would have to be considered. It is still not evident that an A* state is best described by the unique representation of a vessel however. A unique representation of a vessel would include all components and their exact quantities. This means that two vessels with only a small di.erence in concentration for a particular component would be de.ned as di.erent states.

The second reason for the modi.cations of the A* algorithm is due to a non-admissible heuristic (see Section 1.4). The result of having a heuristic that commonly makes overestimations, is that early termination of iteration in the algorithm is likely, and there is a risk of having a non optimal path returned.

The result of having a potentially in.nite state space, and for it to be rare to encounter the same state twice, is a need to be .exible with a state de.nition for the A* open set versus the closed set. The reason it is desirable to have a smaller state space is that there are fewer resulting nodes to visit when searching.

2.4.2 Closed Set State Representation

The closed set contains states that have already been expanded, or visited, and its neighbors, i.e. the resulting test tubes after adding a fuel strand, have been added to the open set. When a new state is found, call it state A, and the cost of traveling to this state is lower than the cost of travelling to an already visited state, call it state B, if these states compare equal, then state A will replace B. Since the neighbors of visited state B have already been added to the open set,and B compares equal to A, the neighbors of state A are assumed to be the same as the neighbors of B. The neighboring states of A are therefore not expanded, and are consequently not added to the open set. When the A* algorithm .nishes execution and returns a path to a goal state that includes state A/B, for the path to be of any value, it is essential that the path from state A/B to the goal state is valid. The reason for this concern is, that the path from state A/B to the goal state is NOT valid if state A and B dont share the same neighbors, and neighbors neighbors, and so on, all the way to the goal state.

2.4.3 Open Set State Representation

The need for states that compare equal to have identical neighborhoods, is true for the closed set, but the same is not true for the open set. In order to reduce the size of the state space, it is an attractive idea to not add a newly found state as a new entry in the open set if the open set already holds another state that is similar to the state, not only if the open set holds a state that is identical. That is to say, that if a newly found state is similar to a state in the open set, and if the cost of travel to the new state is lower, the old state should be replaced, and if the cost of travel to the new state is higher, the new state should be dropped. To implement this idea, the notion of state equality needs to be revised. A state need to have to di.erent methods for measuring equality, one for the closed set and one for the open set.

2.4.4 Overestimation and Underestimation

When using a non-admissible heuristic that commonly overestimates with the classical A* implementation, the algorithm might return a goal path that is not optimal. If a heuristic that frequently overestimates is pushed down, for example by multiplying with some constant, there might still be a problem of late termination.

The termination criteria of classical A* is to terminate once the value of f() of the best found goal state is lower than the values of f() of all states in the open set. With an admissible heuristic, this guarantees that the returned goal state will be optimal. If the heuristic used, commonly makes overestimations, when iteration stops because the f() value of a found goal state is lower than all states in the open set, one of the states in the open set could still be on the path to a goal state with an even lower value of f(). Because of this, it might be useful to allow for a continued exploration of the state space.

In Figure 2.14 the progression of A* with an overestimating heuristic is illustrated. In this instance of A* every node has three neighbors. One neighbor leads one step closer to a goal node, one neighbor leads one step away from a goal node, and the last neighbor doesnt take a step in either direction, with no change in the actual remaining cost of travel. The starting node is the node added on the 0th iteration with a remaining actual cost of travel to the goal node of 2.

In the example, the heuristic function makes an approximate overestimation of ten times the actual remaining cost of travel. When the .rst node expands its neighbors, the heuristic function accidentally estimates that the remaining cost of travel from the closer neighbor is higher than the remaining cost of travel from the neighbor with no actual change in remaining distance. This is something that could realistically happen as heuristic functions simply provide estimates. The consequence is that the deapth-.rst like search that is a result of the overestimating heuristic, branches o. from the optimum path to a goal, and a non-optimal goal is returned.

Reversely, if the heuristic commonly makes underestimations, with a large state space, nodes with many neighbors, and long execution times for each iteration, early termination of the algorithm might be of interest. The more a heuristic underestimates the remaining cost of travel to a goal state, states closer to the starting node will be prioritized, and the A* algorithm will perform a search more similar to a breadth-.rst search.

For the A* instance with DNA, the concept of an optimal solution is somewhat vague. Cost can be measured in quantity of added fuel strand or the actual cost of making the fuel strands, but the main focus is to minimize the number of steps. The importance of .nding a goal path at all however, is much greater than that of .nding what might be the optimal goal path, and if execution time can be reduced, then this might be preferred to .nding amoreoptimal goalpath.

The scenario shown in Figure 2.15 is the same as that in Figure 2.14 but with an underestimating heuristic instead of an overestimating heuristic. The resulting breadth-.rst-like search can be seen as the nodes closest to the start node always have priority due to the severe underestimation.

2.4.5 Priority Queue

The data structure of choice for the open set is a priority queue. To deal with newly found states that compare equal to, and replace, an already existing state in the open set there is a need for an update,or replace function. Such a function is not always included in a standard priority queue de.nition. It requires immediate access to an item of a particular value. With the use of a hash table this can be done in O(1) time.

The quick access time time is particularly important for the event when a state in the closed set is updated. An improvement in the path to the updated visited node in the closed set, means that all outgoing neighbors from this node need to be updated also. If some of these neighbors have already been visited, this means their outgoing neighbors have been expanded also, and they need to be updated too. This leads to a potentially large number of items in the priority queue that need to be updated, and an access time of O(n) would in this case not be acceptable.

2.5 A* DNA Implementation

The abstract methods of the A* abstract class that need to be implemented are; the is_goal_state() method, the get_neighbors() method, the two methods for a state representation in the open set and the closed set,and the heuristic method.

2.5.1 Goal State Detection

The is_goal_state() method is used to determine if a found state is a goal state or not. The method checks for the existence of the goal component in the vessel that represents the state. Various extensions of this method are possible. An added concentration requirement for the goal component, in the goal vessel is one such idea for an extension. Another potentially interesting goal state de.nition might be the existence of more than just one component.

2.5.2 Construction of Neighbors

The get_neighbors() method gets the neighbors of a current vessel by making copies of the vessel and adding to them, di.erent fuel strands. The content is then stabilized and the neighboring vessels are returned. Which fuel strands to add is decided by looking at the strands that compose the goal component. Strands that exist in the goal component but are not present in the vessel are always interesting. In some processes there might be a need to remove strands that are present in the vessel. These strands can be removed by adding to the vessel, their inverse. The inverse strand will then anneal to the unwanted strand, and form a component which has no unannealed segments. Such a component with no unannealed segments can not a.ect any other components in the vessel and is therefore called a stable component.

2.5.3 Open Set Representation

There are a few ways to represent a vessel in the open set. To reduce the size of the state space, two vessels that are similar in its content should compare equal (previously discussed in Section 2.4.3). The preferred way to do this is to select the most prevalent components that compose a certain percentage of the vessel, and let these components alone, without their quantities, de.ne a vessels open set representation.

2.5.4 Closed Set Representation

For the closed set representation of a vessel, it is important for two vessels that compare equal to share the same outgoing neighbors, and the same neighbors neighbors, and so on, until a goal vessel has been reached (previously discussed in Section 2.4.2). There is still potential for reducing the state space however. If a low cost path has been discovered to a vessel that is very similar to a vessel in the closed set,it might be worth taking the risk of replacing the more costly already visited vessel. To reduce the risk of creating inconsistent paths, all information about updated nodes in the closed set need to be stored. This information is later used for a potential rollback operation if an inconsistency in a goal path is discovered.

2.5.5 Heuristic Function

There are two main ideas behind the heuristic function that estimate the remaining cost of travel from a vessel to a goal vessel. Since a goal vessel is de.ned by the presence of a goal component, component-to-component comparisons need to be done. The .rst idea is to compare the label count of two components. For every label present in in the two components, the number of segments with that label is counted for each component. These two numbers are then compared.

The second idea for comparing two components is by comparing their segments connectivity similarity. Two segments are compared by using their unique breadth-.rst component iterator. This process is very similar to that of segment comparison discussed in Section 2.1.2. At every iteration the two current segments are locally compared by their label and their closest neighbors. When two current segments that are being visited compare unequal, iteration stops. The longer iteration keeps going the more similar the two segments are.

2.6 Graphics

The graphics module for drawing components, use the open source graph visualization software Graphviz, and the python interface PyGraphviz. It is easy to see the similarity between the structure of a component and the structure of a graph, where component segments translate to graph edges. But the task to translate a component into a graph is not straight forward. The connections in a component is really that of segments being connected 3 end to 5 end, and segments connected anti-parallel by annealings to segments with inverse labels. In a sense it would be more correct to regard segments as nodes, and their connections to other segments as edges. This would however not be useful for using a graph generating tool to make graphical component representations.

Taking advantage of a graph drawing module, the di.culty remains to show how two segments are annealed to each other. With a segment represented as an edge in a graph, how to represent connectivity between two edges is not evident. Using some of the features of the PyGraphviz interface, short invisible edges between two edges representing annealed segments are added to bring them close and anti-parallel. Figure 2.16 shows these invisible edges.

The Graphviz "spring model" layout, neato, does a good job of distributing the connected segments but a problem of edge crossings remain. An implementation of simulated annealing is used to minimize these edge crossings. An elementary move for the simulated annealing implementation consists of, for every component node (see Section 2.1.1) in the graph, the edge endpoints corresponding to the segment ends connected to the node, swapping coordinates. Figure 2.17 shows two examples graphics output, one output with minimizing edge crossing, and one output without minimizing edge crossings.

Chapter 3

Results

The 2193 lines of python code written, has produced various kinds of goal components, and provided the ordered list of fuel strands needed to build them. Among these are the tetrahedron, the cube, the stack, and the rickettsia motif. In the history of DNA nanotechnology there have been various polyhedra created, hence the tetrahedron and the cube. The applications for the stack is explained in Jessica P. Changs thesis [CHA01], and a description of the rickettsia motif is found in article "An autonomous polymerization motor powered by DNA hybridization" [NNT03]. The .nal goal components for these A* runs are shown in Figure 3.1, Figure 3.2, Figure 3.3, and Figure 3.4. See Section A in the appendix for the entire goal paths leading to the goal components.

All experiments presented here, have been carried out on a Macintosh 2GHz Intel Core 2 Duo MacBook. Given the nature of the described method used for simulating DNA, performance of a particular calculation is sometimes presented in execution time and sometimes in iterations performed in a certain time frame. For some parameters when stabilizing a vessel, iteration might never stop unless there is a limit to execution time. Comparing the execution time of two runs of stabilize that reach the maximum execution time limit, would have no meaning, instead comparing the number of iterations performed in the given time, might say something about performance.

Unique segment selection method Components created Time handling subsets Time spent sorting Average segments sorted

Sort all segments 10712 0% 64.27% 17.71

No intersecting subsets 12962 36.82% 10.68% 5.11

Two intersecting subsets 12770 43.22% 0.62% 1.29

Three intersecting subsets 10890 49.34% 0.17% 1.03

Table 3.1: Unique segment selection experiment for components [a0 a1] and [A0 A1].

Unique segment selection method Components created Time handling subsets Time spent sorting Average segments sorted

Sort all segments 6811 0% 69.91% 29.01

No intersecting subsets 9460 41.48% 6.54% 3.55

Two intersecting subsets 9023 48.08% 0% 1

Three intersecting subsets 6326 64.15% 0% 1

Four intersecting subsets 3278 80.34% 0% 1

3.1 DNA Simulation

3.1.1 Unique Segment Selection

The majority of execution time when stabilizing a vessel, is spent creating unique representations for newly created components. A unique representation is obtained by uniquely selecting one segment of a component, and then using that segments unique iterator, to iterate over all its connected neighbors (previously discussed in Section 2.2.1). Uniquely selecting one segment of a component by sorting all segments is costly. The number of segments to be sorted is therefore screened by creating subsets with segments of certain properties, and by creating even more and smaller subsets from the intersections of these subsets.

Experiments were carried out to investigate the e.ectiveness of the segment screening method. One experiment was made by stabilizing a vessel with components [a0 a1] and [A0 A1],Table 3.1,and one by stabilizing components [a0 b0 c0 d0] and [A0 B0 C0 D0], Table 3.2. A vessel was created six times and let stabilize for 10 seconds. Presented in these tables is; the total number of components created during this time, the time spent creating and organizing subsets of segments, the time spent sorting segments, and the average number of segments sorted for every time a set of segments were sorted.

Unique segment selection method Execution time Time handling subsets Time sorting Segments sorted

Sort all segments 102.35 0% 86.17% 24

No intersecting subsets 87.58 46.47% 22.37% 8

Two intersecting subsets 83.80 58.36% 8.18% 4

Three intersecting subsets 88.83 66.55% 2.08% 2

Two experiments were carried out by only creating the string representation of two di.erent components 10,000 times. In the .rst experiment a complex component chosen for its similarity of segments was used, Table 3.3. The component used is shown in appendix.

In the second experiment a component with the sequence [a0... a1A1 |A0... A1a1] repeated 100 times was used. Figure 3.5 shows the pattern of the four di.erent segments that is repeated in this component. Table 3.4 hold the results for the experiment. Data presented in these tables is; the execution time for making unique representations 10,000 times, the time spent creating and organizing subsets of segments, the time spent sorting segments, and the number of segments sorted every time a component was created.

Unique segment selection method Execution time Time handling subsets Time sorting Segments sorted

Sort all segments 114.77 0% 99.47% 400

No intersecting subsets 1.10 32.96% 0.04% 2

Two intersecting subsets 1.14 31.36% 0% 1

Three intersecting subsets 1.15 32.33% 0% 1

Minimum concentration level Execution time Stabilize iterations Components created

10-2 0.15 8 38

10-3 1.47 11 172

10-4 7.16 14 328

10-5 30 12 886

10-10 30 6 1381

3.1.2 Minimum-Component-Concentration

An attempt to stabilize a vessel can often lead to a potentially in.nite number of resulting components (previously discussed in Section 2.3.2). A minimum-component-concentration level is used to decide which components in a vessel can be ignored. This minimum-component-concentration level is also used to determine when a vessel is considered stable, and greatly a.ects the time needed to stabilize a vessel.

An experiment was carried out to examine the e.ect of di.erent minimum-component-concentration levels. Testing was performed by stabilizing a vessel with the components [a0 b0] and [A0 B0] with a maximum time of 30 seconds to stabilize. Presented in Table 3.5 is; the stabilize execution time, the number of stabilize iterations, and the number of components created.

All runs with di.erent minimum-component-concentration levels end up with the same two most common components and approximately the same concentrations. Interesting to note is the 3rd and 4th components of highest concentration for a minimum-component-concentration level of 10-10 is the original [a0 b0] and [A0 B0] components.

3.1.3 Intra-and Inter-Component-Anneals

In a situation where a component in a vessel can anneal either to itself, or to another component in the vessel, parameters for stabilize are used to decide which is more likely (previously discussed in Section 2.3.3). The likeliness of intra-component-anneals and inter-component-anneals have a great impact on the concentrations of the resulting components. An experiment was carried out to stabilize a vessel containing components [a0 b0] and [A0 B0] for a maximum of 180 seconds, and di.erent parameters for the probability of intra-component-anneals and inter-component-anneals were used.

Table 3.6 shows the proportion of the vessel content that will perform intra-component-anneals and inter-component-anneals. There is data for the execution time needed to stabilize the vessel contents, the number of stabilize iterations, and the number of components created.

Also shown is the quantity of the components shown in Figure 3.6 and Figure 3.7. These components are referred to as component A and component B in the table.

Intra-/Inter proportion Execution time Stabilize iterations Components created Comp. A Comp. B

0.9 / 0.1 1.01 28 160 0.88 0.04

0.45 / 0.05 2.11 87 174 0.87 0.05

0.8 / 0.2 3.17 21 406 0.75 0.07

0.4 / 0.1 4.88 52 458 0.74 0.08

0.5 / 0.5 96.78 17 4466 0.37 0.06

0.25 / 0.25 102.85 37 4480 0.40 0.08

0.2 / 0.8 180 9 6762 0.10 0.01

0.1 / 0.4 180 19 6713 0.13 0.03

Stabilize Iteration a0 A0 [a0A0 | A0a0]

0 1.00 1.00 0.00

1 0.50 0.50 0.50

2 0.25 0.25 0.75

3 0.12 0.12 0.87

4 0.06 0.06 0.93

5 0.03 0.03 0.96

3.1.4 Transition-Weight-Problem

In an experiment, a vessel originally containing only components [a0] and [A0] was stabilized. Table 3.7 shows the concentration of [a0], [A0] and [a0A0 | A0a0] at each iteration of stabilize. Stabilize was run with no null-transitions, i.e. the speci.ed proportion of intra-component-anneals and inter-component-anneals, add up to one. With a minimum-componentconcentration of 10-2 the stabilize process stops after 8 iterations, with 10-5 after 18 iterations, and with 10-10 after 35 iterations.

Average estimation to target ratio Execution time Returned path to optimum path ratio

0.54 551.38 1.00

1.00 161.84 1.00

10.05 18.23 1.00

15.53 15.97 1.16

3.2 A*

3.2.1 Overestimation and Underestimation

An experiment was carried out to examine the e.ects of overestimation and underestimation with the A* algorithm (previously discussed in Section 2.4.4). The result from the heuristic function was multiplied by a constant in order to achieve values below and above the target value. The goal component used for the experiment was the cube component shown in Figure 3.2. For every step on the goal path, from start node to goal node, the ratio of estimated-cost-to-goal to target-cost-to-goal was taken into consideration. An average of this number was used for the .rst column in Table 3.8 along with execution time, and the closeness of the returned goal path to the optimum goal path.

Open set representation Execution time Nodes expanded Nodes added Nodes updated Nodes dropped

100% 1519.98 1462 56.22% 0% 43.77%

80% 482.51 640 39.68% 0% 60.31%

50% 346.81 548 29.92% 1.09% 68.97%

20% 313.56 518 26.06% 1.54% 72.39%

1% 311.28 518 26.06% 1.54% 72.39%

3.2.2 Generality of Open Set State Representation

An experiment was carried out to test the e.ects of di.erent open set state de.nitions (previously discussed in Section 2.4.3). To avoid inconsistent goal paths, where a state in the closed set has been replaced by state that does not share all its neighbors, the unique vessel representation was used as closed set state de.nition. The unique vessel representation is constructed by listing all present components and their concentrations, ordered by concentration. In the experiment the open set representation for a vessel was made by taking only the most prevalent components ordered by concentration and ignoring their concentrations.

In Table 3.9 di.erent A* runs with the cube component as a goal component are presented. The percentage in the .rst column indicates the percentage of the vessel content considered when .nding the most prevalent components. Also presented is the execution time, the number of nodes expanded, the percentage of these expanded nodes that have been added to the open set, updated the open set, and the percentage of nodes dropped. All experiments were run with a heuristic function returning a 1.0 average estimate-to-target ratio and all runs returned the optimum path.

3.2.3 Heuristic Function Deviation from Target

Shown in Figure 3.8 is the plot of estimated-cost-to-goal against target-cost-to-goal. This was done for all states, for all goal paths leading to the tetrahedron, the cube, the stack, and the rickettsia motif. The heuristic function was previously discussed in Section 2.5.5.

Chapter 4

Discussion

4.1 DNA Simulation

4.1.1 Unique Segment Selection

From the data presented in Section 3.1.1, it can be seen that if sorting all segments is avoided, much time can be saved. In Table 3.1 and Table 3.2 the number of components created shows how many components the stabilize process created during the 60 second limit. It is clear that not screening any segments at all is a slower process than performing some form of screening. There also seem to be a limit however, to how many subsets it is optimal to create.

With an increasing number of subsets used for producing new subsets, by taking their intersection, the combinations available increase exponentially. Very soon therefore, the computational burden of creating new subsets overtakes the burden of sorting segments. Table 3.2 shows that already after taking the intersection between only two of the original subsets, a unique segment is found and no sorting needs to be performed.

For components with many similar segments, many subsets need to be created to bring the number of segments to be sorted down. In Table 3.3 this is illustrated when there is still no unique segment even when specifying 3 properties. From the table we see that there must be at least 8 segments all with di.erent labels, 8 segments that are anneled, 8 unannealed segments, 8 segments that are ligated at 3 end, 8 segments not ligated at 3 end, and so on. Even in this example however, at some point the computational burden to create subsets exceeds the execution time saved by reducing the number of segments to sort.

The reason sorting many segments can be costly is if the segments are very similar and it takes much time to compare them. In the example shown in Table 3.4 the e.ects of this can be seen. When comparing two segments on the middle of the 400 segment long component almost the entire component might need to be traversed. Take two segments like [a1] in Figure 3.5, when iterating using their breadth .rst iterators, the visited segments will locally compare equal for a very long time. From the table it is clear that some screening can greatly speed up the process of uniquely selecting a segment in a component.

4.1.2 Minimum-Component-Concentration

From the data presented in Section 3.1.2, the conclusion can be made that, it is quicker to stabilize a vessel when more components are ignored. Also clearly shown in Table 3.5 is that more components are created when fewer components are ignored during stabilize. The reason for this is that when fewer components are ignored, more components are allowed to transition into new components. Given that creating new components can be a slow process, creating more components means longer execution times for stabilizing.

When considering more components, also a larger number of iterations for stabilize is needed before the vessel is stable. A maximum allowed execution time for stabilize is therefore often necessary. This means the concentration of the contents of a vessel with an early termination of stabilize, are still changing by an amount higher than the minimum-component-concentration level, between iterations of stabilize. In the event that the maximum execution time is reached, the resulting content of a vessel with a higher minimum-component-concentration level will therefore be more unstable.

Also when ignoring fewer components one iteration of stabilize takes longer time. With a maximum stabilize execution time, this means fewer iterations. For every iteration more and more of a particular component is used up by annealing to other components. The problem with fewer iterations is that more of the original content will remain. This was also observed in the experiment when the 3rd and 4th most prevalent component after a stabilization with a minimum-component-concentration level of 10-10, were the original components.

4.1.3 Intra-and Inter-Component-Anneals

The results from the experiment performed to investigate the e.ects of di.erent proportions of intra-and inter-component-anneals are shown in Table 3.6, in Section 3.1.3. In the table, experiments go from a high probability of intra-component-anneals and a low probability of inter-component-anneals, to the reverse. Every experiment is paired with another experiment where half of all component left untouched between every iteration of stabilize.

The general trend is that the higher proportion of intra-componentanneals, a larger quantity of component A is produced, and a higher proportion of inter-component-anneals give a higher concentration of component B. The .rst thing that happens to a vessel with components [a0 b0] and [A0 B0], is that either segment [a] anneals to [A], or segment [b] anneals to [B]. Once these two components are formed they can either close up and create the component called A in the table, or it can anneal to other components making larger structures. Once the component A is formed however, there are no unannealed segments and it cannot transition into any other component, it is a so-called stable component.

If the component does not close up and form component A,it might anneal to another copy of itself and then close up, creating the also stable component B. This component can also be created in other ways, but the important thing to note is that it takes a greater number of anneals between components, i.e. more inter-component-anneals to create one. This is the reason why a larger proportion of intra-component-anneals tend to cause a higher concentration of smaller stable components, and reversely a higher proportion of inter-anneals cause a higher concentration of larger components.

If stable components are formed at a higher rate with a higher proportion of intra-component-anneals, this means that fewer components go on to create unstable longer components that grow every iteration of stabilize. With a higher rate of stable components being formed, fewer components are created overall, and stabilize execution time is lower.

The reverse happens with more inter-component-anneals, as execution times tend to get longer, more components are created, but fewer stabilize iterations are run. With a greater variety of components, come low concentrations. This is why a decrease in number of stabilize iterations can be seen with more inter-component-anneals. As component concentrations quickly get reduced by every iteration, component concentration levels, or the rate of concentration change, soon fall below the minimum-componentconcentration level, and the vessel content is declared stable in fewer iterations.

From Table 3.6 it can also be seen that the e.ect of introducing null transitions is longer execution times, more stabilize iterations, and a few more components created. A larger number of created components is a result of more stabilize iterations, and the larger number of stabilize iterations is a result of leaving half of all component una.ected every iteration. With untouched component left behind each iteration, the rate of which they are diluted is slower. It therefore takes longer for components to fall below the minimum-component-concentration. Interesting to note is that the number of stabilize iterations approximately double with half of all components left untouched each iteration.

4.1.4 Transition-Weight-Problem

When stabilizing a vessel with the content of component [a0] and component [A0] there is only one resulting product. The only component that can be remaining is [a0A0 | A0a0]. At every iteration of stabilize the fate of each component is decided. The only thing that can happen to [a0] is to anneal to [A0] and the only thing that can happen to [A0] is to anneal to [a0]. With no speci.ed null transitions, it should be that all quantity of [a0] and [A0] anneal in one iteration. In Table 3.7 it can be seen that only half of the quantity of [a0] and [A0] anneal at every iteration.

This is the result of the imperfect but quick solution to the transition-weight-problem described in Section 2.3.4. The execution of stabilize gets to a stop eventually when the concentration change of all components fall below the minimum-component-concentration level. The cost of solving the transition-weight-problem with a system of equations would quickly get very costly with more components to stabilize. Another argument for avoiding the approach of solving a system of equations is the importance to allow for some null-transitions, as discussed in Section 2.3.3.

4.2 A*

4.2.1 Overestimation Underestimation

An overestimating heuristic used with the A* algorithm will tend to perform a search more similar to a depth-.rst search. Also a non-admissible heuristic might return a non-optimum path. An underestimating heuristic will reversely perform a search more similar to a breadth-.rst search. This hypothesis is presented in Section 2.4.4 and is supported by the data in Table 3.8. It can be read from the table that an underestimating heuristic will take longer to .nish execution, and an overestimating heuristic is quicker but might not return the optimum path.

It is worth noting that when using a heuristic function that on average returns a cost of ten times the target cost, the optimum path is still returned. Also the execution time for this overestimating heuristic is approximately one tenth of the execution time for the heuristic that on an average makes perfect estimations.

4.2.2 Generality of Open Set State Representation

The data in Table 3.9 shows that the more general the de.nition of the open set state representation the sooner the A* algorithm terminates. This suggests that it is only the most prevalent components in a vessel that are important to consider when selecting vessels that might be the next step on the optimum path to the goal. Further it can be concluded that the exact concentrations of components in a vessel are not important to the open set, and that too speci.c vessel representations only unnecessarily leads to greater number of vessels being examined.

From the last two rows in Table 3.9 it is fair to assume that the two corresponding experiments produced the same, or at least very similar, open set state representations, as the resulting data is nearly identical. The difference between the two experiment runs, is that in the .rst experiment run the most prevalent components in 20% of the vessel was considered, and only 1% in the second. The conclusion to be made here, is that probably there was always one component making up for 20% of the vessel contents. This would have resulted in identical open set representations for the two experiment runs.

One could imagine a problem arising if the open set state representation were too general however. Important vessels, containing just the right few important components, well on its way to producing a goal component, might be dropped. This is not seen in the presented experiment, where it seems that only the one most prevalent component is important, but in a situation where two parts of a goal component are built in parallel, more than just one component should be of interest.

4.2.3 Heuristic Function Deviation from Target

What can be said about the graph in Figure 3.8 is that estimated-cost deviation from target-cost is fairly small. It has been shown that for the case of using the A* algorithm for building DNA structures, it is not vital to have an admissible heuristic function. With a non-admissible heuristic however, it becomes even more important to have a heuristic which does not make estimations of vastly varying quality. When optimality cannot be guaranteed, a greater deviation from the target value means a greater risk of falling into a local minimum.

Chapter 5

Conclusion

5.1 Accomplishments

The software developed in this project, to sequence the construction of desired DNA structures, has been successful in its task. The only limiting factor is the size of the goal component. For every additional strand in the goal component, the number of neighbors expanded, increase exponentially with an underestimating heuristic. With an overestimating heuristic, the risk of falling into a local minima becomes more severe with additional strands.

As presented in Section 4 the underlying theory for some of the implementation complications described in Section 2, are con.rmed. The complications of introducing a screening process for selecting a unique segment, are clearly justi.ed. Also the modi.cation of the A* algorithm to allow for di.erent node representations in the open set and the closed set, has proven to be very important.

5.2 Extensions

5.2.1 DNA Simulation

In this project, the simulation of DNA is based on the abstraction of DNA discussed in Section 1.3. There are plenty of known (and unknown) variables that need to be taken into account when attempting to make a simulation as accurate as possible. Some of these variables include the pH and the temperature of the test tube content. Also, variables such as the "bendiness" of simplex DNA, and the "bendiness" of duplex DNA, play a big role in simulation. The so called honeycomb lattice made using DNA origami to make bundles of six duplex strands [NAN02], is designed especially to create a rigid structure. The "bendiness" of this structure can no way be compared to that of a strand of simplex DNA.

All these variables that decide the behavior of DNA, have been ignored in order to allow for quick simulations. An extension of the software might however make better compromises, taking more of these variables into account.

The problem of creating too many null-transitions with the current solution of the transition-weight-problem, is regarded as the problem area with the largest room for improvement. It is regrettable that the approach of solving a system of equations has not yet been implemented and tested. Though in theory, this approach would not contribute to any improvement in e.ciency, it might still be interesting to see what could be accomplished.

Another issue with how the transition-weight-problem is solved, is that the probability that two components anneal, only depends on their concentrations. It would seem more realistic that a component would be more likely to anneal to another component with a thousand segments available for annealing, than a component with only one available segment. For example if components A, B,and C all have the same concentration in a vessel. A can anneal to B by a thousand unannealed segments on both components, and A can only anneal to C by one segment on both components being each others inverses. Component A would be more likely to anneal to component B,and less likely to anneal to component C.

A possible extension for the vessel stabilize algorithm, is to have a self-correcting minimum-component-concentration level. If the user wishes to maximize the precision in simulation in a given time frame and the value for minimum-component-concentration is too high, the stabilize algorithm will terminate early. If the value is too low, precision is lost because the stabilize algorithm is interrupted, and fewer iterations of stabilize is performed. With a vessel that keeps track of this data, the value for minimum-componentconcentration level could be automatically adjusted.

5.2.2 A*

Attempts to create an admissible heuristic could be made. As previously argued in Section 2.4.4, optimality might not necessarily be as interesting as keeping execution times low. Although the heuristic with average estimation-to-target ratio of 0.54 used in the experiment presented in Table 3.8, is not theoretically an admissible heuristic, it is still fair to assume that no overestimations were ever made. This is supported by the graph in Figure 3.8 which shows that the estimated values deviation from the target value, is in general low. It is therefor fair to assume that the estimations in the aforementioned experiment were fairly steadily half of the target values. From Table 3.8 showing the results of the overestimating and underestimating experiments, the trend very clearly seems to be that execution time is greatly increased with more underestimations.

Extensions in terms of what the A* algorithm can produce have also been discussed. Suggestions have included de.ning a goal node to be a goal shape rather than a goal component. Instead of specifying a speci.c component, only the connectivity could be speci.ed without including the actual labels of the segments.

Another alternative way of specifying a goal node, is to include more than just one component in the de.nition. It might be realistic to ask for the production of two components in parallel if they are very similar. It might turn out to be bene.cial to create components in parallel rather than one at a time.

Bibliography

[CHA01] J. P. Chang, Topological Modelling of DNA nanomachines, New York University, Courant Institute of Mathematical Sciences, New York, NY, 2010

[ROT01] P. W. K. Rothemund, Sca.olded DNA origami: from generalized multi-crossovers to polygonal networks, California Institute of Technology, Pasadena, CA 91125, 2006

[NNT01] J. Bath and A. J. Turber.eld, DNA nanomachines,nature nanotechnology, 2 (5), pp 275-284, 2007

[NNT02] Y. Ishitsuka and T. Ha, A nanomachine goes live, nature nanotechnology, 4 (5), pp 281-282, 2009

[NNT03] S. Venkataraman, R. M. Dirks, P. W. K. Rothemund, E. Winfree, and N. A. Pierce, An autonomous polymerization motor powered by DNA hybridization, nature nanotechnology, 2 (8), pp 490-494, 2007

[ACS01] Y. Zhang and N. C. Seeman, Construction of a DNA-Truncated Octahedron, American Chemical Society, 116, pp 1661-1669, 1993

[NAN01] W. B. Sherman and N. C. Seeman, A Precisely Controlled DNA Biped Walking Device, Nano Letters, 4 (7), pp 1203-1207, 2004

[NAN02] F. Mathieu et al., Six-Helix Bundles Designed from DNA,Nano Letters, 5 (4), pp 661 -665, 2005

[NAR01] M. Zuker, Mfold web server for nucleic acid folding and hybridization prediction, Nucleic Acids Research, 31 (13), pp 3406-3415, 2003

[NAT01] E. S. Andersen, et al., Self-assembly of a nanoscale DNA box with a controllable lid, Nature, 459 (5), pp 73-76, 2009

[NAT02] P. W. K. Rothemund, Folding DNA to create nanoscale shapes and patterns, Nature, 440 (3), pp 297-302, 2006

[GEN01] F. W. Stahl, The Holliday Junction on Its Thirtieth Anniversary, Genetics, 138 (10), pp 241-246, 1994

[IEE01] P. E. Hart, N. J. Nilsson, B. Raphael, A Formal Basis for the Heuristic Determination of Minimum Cost Paths, IEEE Transactions on Systems Science and Cybernetics, 4 (2), pp 100-107, 1968

[NUM01] E. W. Dijkstra, A Note on Two Problems in Connexion with Graphs, Numerische Mathematik, 1, pp 269-271, 1959

[OBJ01] R. C. Martin, Design Principles and Design Patterns, www.objectmentor.com, 2000

[NHG01] National Human Genome Research Institute, Deoxyribonucleic Acid (DNA), http://www.genome.gov/25520880

[NEN01] Nanorex Inc., NanoEngineer-1, http://www.nanoengineer-1.com

[CAD01] Shawn Douglas, caDNAno, http://cadnano.org

[NUP01] Niles A. Pierce, NUPACK,http://nupack.org

[SAR01] Centre for DNA Nanotechnology, SARSE, http://sarse.ku.dk

[GID01] Subirac, GIDEON, http://www.subirac.com

[ALG01] J. Kleinberg and . Tardos, Algorithm Design, Pearson Education, 2005, ISBN 0321372913

[PUZ01] J. Slocum and D. Sonneveld, The 15 Puzzle, Slocum Puzzle Foundation, 2006, ISBN 1890980153