Literature Survey About Phylogenetic Trees And Reconstruction Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Phylogeny is the evolutionary development of organisms. It describes how species are related and evolutionary relatedness among various types of organisms. Phylogeny shows the ancestral relationship among species both the living and extinct, so the connections between all groups of organisms as understood by ancestor/descendant relationships [1].

2.1.2 Phylogenetics

Phylogenetics is the study of evolutionary relatedness among various groups of organisms, which is discovered through molecular sequencing data and morphological data matrices. For bioinformatics analysis such as construction of sequence alignment, getting the phylogenetic relationship between two or more sequences is often extremely important information. Using phylogenetic analysis the evolutionary history and the relationships between entire species can be understood. The only way to determine phylogenetic relationship between certain species is phylogenetic analysis. Because of the rapid growth of DNA and protein sequence data, phylogetic analysis is very important to understand the evolutionary patterns and understand the originated category of those sequence data.

2.1.3 Phylogenetic Trees

The main output of phylogeny is the phylogenetic tree which is the graphical representation of the evolutionary relationship among taxonomic units. The system of classifying plants and animals by grouping them into categories according to their similarities is called as taxonomy [2]. Phylogenies typically define only those events that are relevant to a set of living taxa, there is no matter whether these taxa correspond to species, sets of species, individual organisms, or even particular genes [3].

Phylogenetic tree is an evolutionary tree. Furthermore, if anyone needs to understand the theory of evolution by evidence support, he should understand the phylogeny because these trees show descent from a common ancestor, and because much evidence for evolution comes in the form of common ancestry [3].

In most cases, researchers draw phylogenetic trees to record only the events that are relevant to a set of living taxa. Most commonly, these taxa are species [3]. Figure 1 shows the basic tree which represents the history of the four "tip" species, A to D. This tree shows that species A and B share a more recent common ancestor with each other, than with either species C or species D. Likewise, species C and D share a more recent common ancestor with each other, than with either species A or species B. This example illustrates the phylogenetic tree in most basic level with history of descent from common ancestry.


Branching pattern of 4 species

According to the above figure, Phylogenetic tree consists of branches and edges. Natural chunks of trees are called as Clades. In the clades, there is a portion of history (specifically, the internal branch that attaches the clade to the rest of the tree) that is common to all members of the clade and but not common to other clades in the tree. As a result, if statements of common ancestry are applied to a clade, it should be always applied to all the tips within the clade [3].

2.1.4 Terminology of Tree Reconstruction Dendrogram view of phylogenetic tree

This is a diagrammatic representation of a phylogenetic tree. phylogenetic trees are made by arranging nodes and branches. Every node represents a distinct taxonomical unit. Terminal nodes relate to a gene or organism for which data sequence exists. Inferred common ancestor, which is represented by internal nodes, has given rise to two independent lineages at some point in the past, for which empirical data are no longer available. Normally computer programs produce the tree information using a series of nested parenthesis called as Newick format. The Newick format of figure 2 can be represent as (((I,II),(III,IV)),V).

Figure 2

Dendogram view

If phylogenetic tree has only two lineages that are descended from it, it can be said as bifurcating. But it is also possible to have a multifurcating trees with three or more descendent lineages. Also branching pattern of phylogenetic trees can be used to convey the information about which evolution has occurred. Also the length of branches can be used to indicate distance between diverged data sets. There are 2 different types of trees related to branch length.

1. Scaled tree

In scaled trees, branch lengths are proportional to the no of evolutionary changes or distance. In the best case, scaled trees are also additive, accumulated differences of nodes are represented by the physical length of the branches connecting any two nodes.

2. Unscaled tree

In unscaled trees, all branches in the tree are of the same length. It lineup all terminal nodes and convey only their relative kinship without representing any of changes that separate them such as distance [4]. Cladogram view of phylogenetic tree

A cladogram is a tree which is formed using cladistic methods .Cladograms are also branched diagrams which represent only branching patterns, similar in appearance to family trees, that illustrate patterns of relatedness where the branch lengths are not necessarily proportional to the evolutionary time between related organisms or sequences. A phylogenetic tree is a specific type of cladogram where the branch lengths are proportional to the predicted or hypothetical evolutionary time between organisms or sequences. Cladograms can be produced to representing relationships between sequences, either DNA sequences or amino acid sequences. However, cladograms can rely on many types of data to show the relatedness of species [5].

The pattern of relatedness in any cladogram is illustrated using possible evolutionary pathways. There are many evolutionary pathways in cladograms. Because of that Cladograms cannot be considered as completely better and accurate descriptions of the evolutionary history of organisms. It does not necessarily clarify the pathway that created the existing relationships in the cladograms. It only illustrates the probability that two organisms, or sequences, are more closely related to each other than to a third organism [5].

Figure 3

Cladogram View

A chronogram is a phylogenetic tree that explicitly represents evolutionary time through its branch lengths [6]. Rooted and Unrooted Trees

Rooted tree is a directed tree with a unique node corresponding to the most recent common ancestor of all the nodes at the leaves of the tree. The most common method for rooting trees is the use of an uncontroversial out-group which is close enough to allow inference from sequence or trait data, but far enough to be a clear out-group[6].

Unrooted trees are not representing the information about ancestry at all. It only illustrates the relatedness of the leaf nodes. Although unrooted trees can always be generated from rooted ones by simply eliminating the root, without some means of identifying ancestry it is difficult to understand a root using an unrooted tree. This process is normally done by including an out-group in the input data or introducing additional assumptions about the relative rates of evolution on each branch [6].

Both rooted and unrooted phylogenetic trees can be either bifurcating or multifurcating, and either labeled or unlabeled.

Figure 4

Rooted and Unrooted tree Importance of phylogenetic trees

Understand the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical and/or genetic characteristics.

Trees provide an efficient structure for organizing knowledge of biodiversity and allow one to develop an accurate, no progressive conception of the totality of evolutionary history.

Phylogenies are useful for providing insight into events that occurred during evolution.

To understand the ancestral category of unknown sequence data either DNA sequence data, Protein sequence data. Limitations of phylogenetic trees

Although the phylogenetic trees are very important part of molecular biology, it has some limitations such as,

If it use noisy data for phylogenetic analysis , then analysis can be confounded by horizontal gene transfer[7], hybridisation between species that were not nearest neighbors on the tree before hybridisation takes place, convergent evolution, and conserved sequences [6].

Because of there are some method to generate phylogenetic trees, it is difficult to select optimal method to generate optimal phylogenetic tree. It has to consider several criteria before constructing phylogenetic tree such as [6],

Efficiency -how long does it take to compute the answer, how much memory does it need?

Power -does it make good use of the data, or is information being wasted?

Consistency -will it converge on the same answer repeatedly, if each time given different data for the same model problem?

Robustness -does it cope well with violations of the assumptions of the underlying model?

Falsifiability -does it alert us when it is not good to use, i.e. when assumptions are violated?

So have to use heuristic search and optimization methods combination with tree-scoring functions to identify a reasonably good tree that fits the data.

Also, there are problems in basing the analysis on a single type of character because such trees constructed from another unrelated data source often differ from the first and therefore great care is needed in inferring phylogenetic relationships among species [6]. Those single types of characters are like single gene or protein or only on morphological analysis [6].

One should be knowledgeable in that area to select the necessary sequence for the phylogenetic analysis.

Understanding the output of phylogenetic analysis is also critical; because of many published phylogenetic trees hide their implementation details.

Quality of phylogenetic tree is depending on quality of data and multiple sequence alignments.

The multiple sequence alignment should be done and create the complete phylogenetic trees to understand the originated category of every sequence data to find which category is unknown. It requires more computational time.

2.2 Steps of Phylogenetic Tree Construction

Do the multiple sequence alignment for selected sequences

Generate phylogenetic tree using output of multiple sequence alignment

2.2.1 Multiple Sequence Alignments

Sequence alignment is a way of arranging the sequences of DNA, RNA or protein to identify or highlight regions of similarity (conserved sequence regions) that may be a result of functional, structural or evolutionary relationships between the sequences.

If two sequences in an alignment share a common ancestor sequences then mismatches can be interpreted as single base substitution and gaps (insertion or deletion mutations) introduced in both lineages in the time since they diverged from one another. Either the similarity approach which aims to maximize the number of identically matched nucleotides in the two sequences or distance approach which aims to minimize the number of mismatches will often identify the same alignment as the best one.

The human knowledge on sequence alignment is applied and has constructed high quality sequence alignment algorithms because we cannot align lengthy, highly variable sequences using human effort. Multiple sequence alignment is aligning more than two sequences at a time. The basic information from a multiple alignment of nucleotide sequences is the position and the nature of the conserved regions in each member of the group. Conserved sequence regions correspond to functionally and structurally important parts of the sequence. Sequence similarity can be reflected using multiple sequence alignment.

There are some automated tools to automatically get the multiple sequence alignment. Some of those software tools are,

Clustal W

Today this is the most popular multiple alignment program. W stands for weighted. It is a fully automatic program that produce biologically meaningful global multiple sequence alignment of divergent sequences [8]. The fact that is used in ClustalW is that the homologous sequences are evolutionarily related [8]. ClustalW uses the progressive approach to carryout multiple alignments. It has three step processes that are [8].

(a) Construct Pair wise alignment

(b) Build guide tree

(c) Progressive alignment guided by tree

Clustal W first aligns the most similar sequences and gradually adds in the distant ones [8]. This approach works well when the data set contains sequences of different degree of divergence [8]. When all the sequences are highly divergent the progressive approach is less reliable [8]. Clustal W use sequence weighting, position specific gap penalties and weight matrix choice to advance the sensitivity of progressive multiple sequence alignment [8]. There exists both online version and installable version freely available.

Link to Tool:


Provides a window-based user interface to the Clustal-W multiple alignment program [9].

2.2.2 Computational Approaches for Phylogenetic Tree Construction

Phylogenetic tree can be generated using some methods by using the output of multiple sequence alignment. To date there is no perfect method for tree construction and several methods are used. There are several phylogenetic tree construction methods in existence. Some of those are UPGMA, Minimum Evolution (ME), Neighbor Joining (NJ), Maximum Parsimony (MP) and Maximum Likelihood (ML).

The way in which multiple sequence alignment is converted into numerical data is the main distinction between the different tree building methods. Numerical data is analyzed mathematically in order to construct a tree. Some of those methods are,

Distance Matrix Based Methods


Neighbor Joining

Minimum Evolution

Maximum Parsimony Method

Maximum Likelihood Method

Tree Merging

Consensus trees

Super Trees Distance Matrix Methods

This method calculates a measure of distance between each pair of sequences, and then finds out the tree that predicts the observed set of distances as closely as possible [10]. These methods reduce the data matrix to a simple table of pair-wise distances. These methods leave out all the information from higher order combinations of character states. It does not use any statistical model of evolution. In distance matrix methods branch lengths do not simply represent the evolutionary time but the expected amount of evolution [10]. Two branches of a tree may reflect the same elapse time but with different amount of expected evolution [10]. There are some methods available in distance matrix method such as,

Minimum Evolution Method (ME)

This method uses the standard, the total branch length of reconstructed tree [10]. The tree fits with data in this method [10]. The branch lengths are determined using the unweighted least square method [10]. The least square trees are determined for different topologies and choose the one with shortest total length among them [10]. Because of branch lengths of all topologies must be evaluated, the construction of minimum evolution tree is time consuming. The main drawback in this method is the total length of the true tree will not be the shortest all the time [10].


UPGMA stands for Unweighted Pair Group Method with Arithmetic mean. It is a hierarchical clustering method [10]. UPGMA method assumes a constant rate of evolution among the different lineages [10]. UPGMA assumes equal rates of mutation among all the branches. The algorithm examines the structure present in a pair wise distance matrix to them to construct a rooted tree [10]. At each step the nearest clusters are combined into a higher cluster. The main drawback of UPGMA clustering method is very sensitive to unequal evolutionary rates [10].

Neighbor Joining Method

This method uses a bottom up clustering method. This is widely used distance based method. It does not assume all lineages evolve at the same rate and instead approximates the minimum evolution criteria [10]. It is a greedy algorithm that constructs the tree in stepwise fashion. At each step of the algorithm the tree topology that gives the least total branch length is preferred [10]. The neighbor-joining method is computationally efficient with relative to the other methods [10].

A drawback of this method is it will not be able to find the true tree topology with minimum total branch lengths. Neighbor joining method is a simple way of creating trees as the information content of the multiple alignments is reduced to its simplest form. Unfortunately as a result of this, information is lost, in particular those pertaining to ancestral identities at each position in the multiple alignments [10]. Maximum Parsimony Method

Maximum Parsimony method assumed that evolution follows the shortest possible path and the correct phylogenetic tree is therefore the one that requires the minimum number of nucleotide substitutions to produce the observed differences between the sequences [10]. Trees are therefore constructed at random and the number of nucleotide substitutions that they involve calculated until all topologies have been examined and the one requiring the smallest number of steps identified [10].

More sequences added means more trees need to be generated. For example, with five sequences only 15 possible unrooted trees generated but with 10 sequences, 2,027,025 uprooted trees and with 50 sequences the number exceeds the number of atoms in this universe [10]. Not even super computers can evaluate all the trees with Maximum Parsimony method [10]. This is also true for the sophisticated methods such as Maximum Likelihood method. Maximum Likelihood Method

Maximum likelihood method use standard statistical methods to make the estimate of phylogeny with models of evolution for a character [10]. The probability that the proposed model and the hypothesized history would give rise to the observed data set was used to evaluate the hypothesis about evolutionary history [10]. The assumption is that a history with a higher probability of reaching the observed state is preferred to a history with a lower probability [10].

The method searches for the tree with the highest probability or likelihood [10]. Maximum likelihood Method first picks an evolutionary model using a model test [10]. Then it will generate all possible tree structures for each position [10]. Based on the evolutionary model, it will calculate likelihood of these trees and sum them to get the column likelihood for each OTU cluster [10]. Then it will calculate tree likelihood by multiplying the likelihood for each position [10]. Finally it will choose tree with greatest likelihood [10]. Maximum likelihood method is very flexible [10]. It gives good results under good evolutionary models and bad approximations under bad evolutionary model. Widely used phylogeny programs and packages


PHYLIP stands for PHYLogeny Inference Package. PHYLIP is the most widely used package of programs for generating evolutionary trees. It is available freely over Internet also. PHYLIP already compiled executables are available for Windows, Mac OS X and Linux [11].

PHYLIP can handle DNA/RNA sequence data as well as protein sequence data. The program reads the input data from a text file in ASCII or Text Only format. ClustalW, ClustalX alignment programs can write data files in the PHYLIP format. Output is written into a special file named outfile and outtree. Trees are written onto the out tree in Newick format [11].


BioEdit is a biological sequence editor that runs in Windows 95/98/NT/2000 and is intended to provide basic functions for protein and nucleic sequence editing, alignment, manipulation and analysis. BioEdit is not intended to be a powerful sequence analysis program, but is instead intended to be a user friendly interface for sequence manipulation and linking to other, more scientific analysis programs, both local and WWW-based. It is now suitable for large alignments

(>2000 sequences) [18].

Available at:


MEGA stands for Molecular Evolutionary Genetics Analysis [12]. This software tool can be used to infer evolutionary relationships of homologous sequences and to estimate neutral and selective evolutionary variance among sequences [12]. It has employed statistical methods in analyzing sequences [12]. It can used to analyze larger number of sequences (>10000) even with millions of base pairs [12]. MEGA, an integrated tool also supports for sequence alignment [12]. Also DOS, Windows and Linux versions are available.

The number of possible phylogenetic trees grows exponentially when the number of data set grows [14]. To reduce the time spend on non optimal tree, most of the phylogenetic inference methods use clustering approaches (Neighbor Joining Method) or heuristic search strategies structures out from all possible tree structures [14]. The space of topologies can not search exhaustively [13]. Heuristic search also slow in this type of scenario [13] [14]. In this type of serious computation maximum likelihood method is used [13] [14].

2.2.3 Names of DNA and protein sequences which mostly use for phylogenetic analysis

To choose the appropriate sequence to generate the phylogenetic tree, we have to consider some criteria such as sequence should represent the wide range of organisms and it should have more divergence or not. Following represent the sequences which are widely used for phylogenetic analysis.

16s ribosome RNA gene

12s ribosome RNA gene

18s ribosome RNA gene

Microcondial RNA gene

Transfering sequences


2.3 Neural network usage for the phylogenetic analysis

Although there are more methods available to phylogenetic analysis they all have some drawbacks so there is a trend to use neural networks as replacement of phylogenetic analysis. Because of some advantages other than phylogenetic analysis, researches use several types of neural networks to generate the phylogenetic trees. Neural networks are used in several areas in sequence analysis such as DNA introdexon discrimination and gene identification, DNA and protein pattern analysis, protein secondary tertiary structures prediction, gene classification [21], protein family classification [15], and phylogenetic analysis [15].

In the machine learning area, Artificial Neural networks (ANNs) belong to the adaptive class of techniques. ANNS can be used as a solution to various problems. Neural networks have intelligent pattern recognition methodology for inspired as the biological neural networks [17]. Most models of ANNs consist with a number of processing units called artificial neurons and also a number of weighted connections (artificial synapses) between the neurons [17]. The process of building an ANN, involves a learning episode called as "training" which is similar to its biological inspiration. During training episode the network observes a sequence of recorded data, and adjusts the strength of its synapses according to a learning algorithm and observed data [17]. "Learning" is the process of adjusting the synaptic strengths in order to be able to fulfill a certain task. Learning algorithms can be generally divided into two types which are supervised and unsupervised.

Supervised learning algorithm

In supervise learning algorithms; they require more a priori knowledge about the training set. It's most attractive feature is their capability of learning from example. We should know the outputs of neural network before training in these networks. A supervised network is trained by a data set of predefined organization scheme and used to classify new sequences into the data set [15].

Unsupervised learning algorithm

Unsupervised learning algorithms does not require a priori knowledge about the training set so it has advantages in pattern analysis of diversified data sets [15]. It defines its own organization scheme according to input data set. So those type of neural network is used in various problems such as detect signal peptide coding region and potentially functional regions of nucleic acids (Amgo et al., 1991; Giuliano et al.,1993), protein sequence classification (Ferran & Ferrara, 1991;Ferran et al., 1994; Andrade et al., 1997), and protein pattern recognition (Hanke et al., 1996; Hanke & Reich, 1996)[15]. Advantages of using neural network approach for phylogenetic analysis.

Neural networks (NNs) have several unique features and advantages over conventional statistical methods. Those are,

They incorporate both positive and negative information and also they are able to detect second and higher-order correlation in patterns and a pre aligned model is not required [15].So this feature of neural networks is more suitable for molecular analysis.

Although the data set is noisy, neural network give more robust result than other methods [16]

Provide more accurate result than other hierarchical clustering methods.

Neural networks are faster than other methods. Design Steps of neural network

The designing of a neural network consists of several steps such as,

Collect data.

Create the network.

Configure the network.

Initialize the weights and biases.

Train the network.

Validate the network.

Use the network. Preparation of the data sequence as neural network input.

There are some pre processing tasks that have to be done to create input using sequence data for the neural network. It uses the sequence encoding schema to convert molecular sequences (Character strings) into input vectors (numbers) of the neural network classifier. Similar sequences are represented by 'close' vectors in this. There are two different approaches for the sequence encoding [21].

Use sequence data directly

As in most neural network applications of molecular sequence analysis it can use sequence data directly. To do that an indicator vector is used to represent each molecular residue in the sequence string. If sequence is amino acid, use a vector of 20 input units of which 19 have a value of zero, and one has a value of one. If sequence is a nucleotide use a vector of four units of which three are zeroes and one is one. If we have long and varied-length sequences to be compared, this representation, however, is not suitable for sequence classification.

N-gram Method

This method extracts and counts the occurrences of patterns (terms) of n consecutive residues (i.e., a sliding window of size n) from a sequence string [21]. Along the sequence, it uses count but not positions, of the n-gram terms. Therefore, this method provides certain insertion/deletion invariance, length-invariant, and does not require the laborious sequence alignments of many other database search methods. The counts of the n-gram terms from each encoding method are scaled to fall between 0 and 1 and it used as input vectors for the neural network, with each unit of the vector representing an n-gram term [21]. If the size of the alphabet is m then size of the input vector for each n-gram extraction is mn. The alphabet sets used for protein sequences include the 20-letter amino acids and nucleic acid sequences is the four - letter representation of AT(U)GC. Size of the input vector tends to be large is the main drawback of the n-gram method so this indicates that the size of the weight also be large [21]. Because the weight matrix size equals to w, where w = input size x hidden size + hidden size x output size. There for it prohibits the use of larger n-gram sizes.

SVD method

It can be use Singular Value Decomposition method in neural networks to avoid data over fitting and to provide better generalization capability with minimal architecture accepted by fewer neurons and interconnections. It reduces the size (i.e., the number of dimensions) of the n-gram vectors and to extract semantics from the n-gram patterns [21]. In this method, n-gram term matrix (i.e., term-by-sequence matrix) is decomposed into a set of k orthogonal factors from which the original matrix can be approximated by linear combination [21].

2.3.2 Types of neural networks used in phylogenetic analysis

There is a very large body of research that has resulted in a large number of ANN designs in biological sequencing and phylogenetic analysis [17] .Some of those neural networks is,

Layered feed forward back propagation neural networks (supervised neural network).

Probabilistic neural networks (supervised neural network).

Self organizing Map (unsupervised neural network).

Self organizing tree algorithms (unsupervised neural network). Layered feed forward back propagation neural networks

These types of neural networks are supervised learning networks which can be used in biological sequencing and pattern recognition problems. Neurons are organized in layers in these networks. The layers are normally fully connected because of each element (neuron) of a layer is connected to each element of the next layer. Because of this network starts either with a minimal number of synaptic connections between the layers and adds to the number as training progresses (constructive), or starts as a fully connected network and prunes connections based on the data observed in training (destructive) it has self organizing behaviors also [17].

Learning algorithm of these networks is back propagation. When the back propagation learning algorithm and the feed-forward are combined, this layered networks become more popular artificial neural networks. Because of algorithm of these networks is very simple, this is the first network which tried in a new problem. These ANNS can be applied to virtually all pattern recognition problems [17]. In sequencing also many researchers have used this type of network as a first neural network. Wu has developed a system called gene classification artificial neural system (GenCANS), which is based on a three layered, feed-forward back propagation network. This GenCANS was designed to "classify new (unknown) sequences into predefined (known) classes [18]. This network has two steps which are sequence encoding and neural network classification and another step is map molecular sequences (input) into gene families (output). So it can be use layered feed forward back propagation neural networks as supervise learning neural network to understand the category of unknown data sequences either DNA or protein sequences in phylogenetic analysis. Probabilistic neural networks

These types of neural networks are the supervised neural network which can be used as biological sequences classification. The algorithm used in the Probabilistic Neural Network (PNN) is statistical algorithm called Kernel discriminate analysis. It is also a multilayered feed forward network with four layers. Those layers are input layer, pattern layer, summation layer and output layer. Because of it can map any input pattern to a number of classifications, it can be said PNN as a predominantly a classifier. The main advantages of these PNN are fast training process, an inherently parallel structure,

if the size of the representative training set increases it can be guaranteed to converge to an optimal classifier and training samples can be added or removed without extensive retraining any input pattern to a number of classifications [20]. This network can learn more quickly than other neural networks and it have successes in many variety of application. So this type of network out performs the other neural networks in most cases. Because of those advantages PNN can be viewed as supervise learning networks which can be use in system classification and pattern recognition problems [20].

This PNN has been used in protein superfamily classification problems which consists of determining the super family membership of a given unknown protein sequence. This finding is very important for a biologist for many practical reasons, such as drug discovery, prediction of molecular function and medical diagnosis [20].

Architecture of PNNs

The PNN consists of nodes allocated in three layers after the inputs:

Figure 5

Architecture of PNN Self organizing Neural Networks

These networks are a very large class of neural networks. Its structure changes during training, based on the observed data. These networks consist with number of neurons, number of synaptic connections, number of modules, or number of layers. There are two classes of this type of networks. Those classes are destructive and constructive [17]. There is a fully connected topology and the learning algorithm prunes synapses based on the observed data in destructive networks. In constructive algorithms start with a minimally connected network and in the training time it gradually adds synapses (neurons, modules, or layers) to the network [17]. These types of neural networks can be used in biological sequencing and expression level data analyzing. There are some self organizing neural network types such as,

Self-Organizing Map:

This is unsupervised neural network approach which is first proposed by Kohonen . A self-organizing map (SOM) is a type of neural network approach first proposed by Kohonen. These networks can be used as a divisive clustering approach in many areas including genomics [17].Tamayo and colleagues [1324n] use self-organizing maps to explore patterns of gene expression generated using Affymetrix arrays, and provide the GENECLUSTER implementation of SOMs. In this approach SOM assigns genes to a series of partitions. It is done through the similarity of their expression vectors to reference vectors that are defined for each partition. The process of these neural networks can be defined in some steps. As a first step random vectors are constructed and assigned to each partition and second a gene is picked at random and, using a selected distance metric, the reference vector that is closest to the gene is identified. As the third step the reference vector is then adjusted so that it is more similar to the vector of the assigned gene. As fourth step do the second and third step in several thousand times. Have to do these steps for decreasing the amount by which the reference vectors are adjusted and increasing the stringency used to define closeness in each step [17]. When the process continues the reference vectors converge to fixed values. In last genes are mapped to the relevant partitions depending on the reference vector to which they are most similar [17].

There are some advantages in this SOM networks such as they allow a priori knowledge to be included in the clustering process, it has a number of features that make them particularly very suitable to clustering and analyzing DNA and protein sequences and gene expression patterns and it has easy visualization and interpretation facilities, it has good computational properties and are easy to implement, reasonably fast, and are scalable to large data sets [17]. Also there are some disadvantages in SOMs. Most popular one is it is difficult to know when to stop the algorithm. So user need some other source of information like number of cluster that best represent the available data [17]. Also it has some undesirable properties like pre determine topology, lack of higher order co relation in patterns, tendency for common pattern dominate the network [16].

Self organizing tree algorithms:

These self organizing trees are the type of constructive neural network method. It develops a familiar binary tree during training. It grows binary tree topology using hierarchical neural networks. Also SOTA based on Kohonen's SOM. So it can be said as SOTA is a combination of hierarchical clustering and self organizing maps.

In phylogenetic reconstruction, SOTA is first design to pre-aligned sequences in genes to classify protein sequences [17]. But now it has adapted to analyze patterns associated to the frequency of residues along a sequence also. Although some of the neighborhood information in a sequence may be lost in this approach, they can compare sequences of different lengths without having to align them [15].

The output of SOTA algorithm is the distributing binary tree that adapt to the input data sets. SOTA has been used in the gene expression analysis. There are some steps to implement it, first have to provide each gene expression profiles to each terminal node in multiple times in the network. After each presentation, more closely matched cell updates according to the more closely matched with input profile. At the end of each cycle, the cell generates two daughter cells which have the higher resources. That profile of each cell is represented as ancestor of those daughters. This process continues until each cell has a single profile or the desired level of heterogeneity has been met [16]. Distance between average profiles of the nodes are represented by branch length of the final dendrogram.

It has some advantages other than classical hierarchical clustering method such as even with noisy data sets it is more robust , it reflect natural relationship among data, can be used as alternate to phylogenetic analysis [16], time complexity is very low, it has top-to-bottom hierarchical approach, runtimes are approximately linear with the number of items to be classified, also it is suitable for large datasets, it can stop at any level of hierarchy and produce meaning full intermediate results because of SOTA build higher clusters in the hierarchy before forming the lower clusters [17]. So the self organizing trees can be use as alternative neural network approach to phylogenetic analysis. Other neural network techniques

Molecular computers and molecular neural networks:

The DNA neural networks are the most recent and innovative artificial neural network solution which is suggested for data expression profiling and sequencing. It has massive parallel nature. So it offers exceedingly fast computation [17]. The method presents high potential in various aspects of the solution.

Bayesian Neural Networks:

There are a number of recent networks that have been suggested as solutions for sequencing and expression data analysis. One of those types of network is Bayesian neural networks (BNNs). It can be used for gene expression analysis. In this type of networks use structural learning for exploring microarray data in gene expressions with BNNs [17]. This type of neural network can be said as important ANN solutions that have been offered to the problem at hand.