This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Abstract-Visualization plays a vital role in the area of Molecular Biology where biologists are interested in viewing the gene pattern of RNA and DNA. Experiments in detecting the gene pattern with microarrays have led to large datasets of gene expressions. Visualizing these datasets is our main goal. Even though there are many tools available for visualizing these data sets, it may be difficult for the biologist to choose an appropriate tool. The biologist will need different views from these datasets but it is difficult for a single tool to provide all the possible insights of the gene expression from the dataset. Biologists would prefer a tool which would provide them the required information in a simple way and in least possible time. This survey is to study some of the visualization tools that can help the biologists to visualize the dataset obtained from the microarray dataset.
Microarrays are a high throughput technology used to measure expression
level of genes. It is popularly used gene expression profiling tool
to analyze many genes in a single experiment. Microarray is advancing
with numerous applications such as gene expression, genotyping,
and cell biology. Microarray technology allows researchers to analyze
genes and compare the behavior of these genes. Microarray technology
will help researchers learn more about different diseases like heart
disease, infectious disease, mental illness and many more. It also helps
them to find the interaction and relation among the genes. Before the
development of microarray technology, scientist used to classify diseases
based on organs in which tumor develop. With the development
of microarray, scientist are able classify disease based on the gene activity
in tumor cell. The major steps involved in microarray experiment
_ Preparation of microarrays.
_ Preparation of fluorescent labeled cDNA probes and hybridization.
_ Scanning the microarray, image and data analysis Figure 1.
Microarray consists of sequences of Deoxyribonucleic acid (DNA),
proteins, or tissues arranged in the form of arrays for easy understanding
Microarrays are mainly used for:
_ Gene expression studies
_ Disease diagnosis
_ Pharmacogenetics (drug discovery)
_ Toxicogenomics (Study of structure and function of a genome)
The central principle of the microarray technique is the selective
binding of complementary single stranded nucleic acid sequences (hybridization)
and the use of fluorescent probes(Cy3 and Cy5) to visualize
the difference in complementary DNA (cDNA) level that represents
mRNA level , , . Microarrays are created using the
robotic machines which place the genes on a glass slide or a silicon
chip. The process of building a microarray chip is as shown in the
Figure 1 .
Microarray can be broadly classified into 2 types :
1. cDNA array (Spotted Array)
Manuscript received 31 March 2010; accepted 1 August 2010; posted online
24 October 2010; mailed on 16 October 2010.
For information on obtaining reprints of this article, please send
email to: email@example.com.
Fig. 1. Process of building a Microarray Chip .
2. High-density Oligonucleotide array(Affymetrix)
The 2 types of Microarrays are as shown in the Figure 2.
Fig. 2. a) Affymetrix Chip b) Spotted Array Chip .
In an Affymetrix chip each gene has around 16-20 pairs of probes
synthesized on the chip. Each pairs of probe have two oligonucleoties:
_ Perfect match (PM, reference sequence) ATG...C...TGC (around
_ Mis-match (MM, one base change)ATG...T...TGC
The major difference between the two types of microarray is that in
cDNA(Spotted array) we use two different dyes i.e., Cy3 and Cy5 for
labeling the RNA, where as in Affymetrix we use only one type of dye
for labeling RNA.
If the RNA from the healthy sample is in abundance, the spot will
be green, if the RNA from the tumor sample is in abundance, it will
be red. If both are equal, the spot will be yellow; else if neither is
present then the spot will be black. In prepared microarray slide, each
spot contains multiple identical strands of DNA. The DNA sequence
in each spot is unique. This microarray is then scanned to extract the
gene information. The scanning of this microarray is done twice, once
to find the Cy5 fluorescent and second time for Cy3 fluorescent. This
process is as shown in Figure 3.
Fig. 3. Scanning process of Microarray 
The raw data obtained by the microarray slide are digital images. To
obtain information about the gene expression levels, these images are
analyzed, each spot on the array identified and its intensity measured
and compared with the values representing the background. Image
quantization is usually carried out using the image analysis software.
The three phases of processing scanned images are
1. Addressing or Gridding
_ Assigning coordinates to each of the spots.
_ Classification of pixels either as foreground or as background.
3. Intensity extraction (for each spot)
_ Foreground fluorescence intensity pairs (R, G).
_ Background intensities.
_ Quality measures.
During hybridization some of the probe mRNA will attach to the
array, even when there is no cDNA available. This is known as Background
Intensity. An essential feature of all image analysis software
is that the digital microarray images are processed and the data are
extracted and combined in a table. This is known as a spot quantitation
matrix. Each row corresponds to one spot on the array, and each
column represents the different quantitation characteristics of the spot,
such as median or mean pixel intensity of the spot and local background
The data from multiple hybridizations must be further transformed
and organized in a Gene Expression Matrix as shown in Figure 4. In
this matrix, each row represents a gene, and each column represents
experimental conditions, such as a particular biological sample. Each
position in such a matrix characterizes the expression level of the particular
gene, under a particular experimental condition. Building such
matrices will help us to understand gene regulation, metabolic and signaling
pathways, the genetic mechanisms of disease, and the response
to drug treatments. For instance, if over-expression of certain genes is
correlated with a certain cancer, we can explore which other conditions
affect the expression of these genes and which other genes have similar
expression profiles . Obtaining the gene matrix that combines
information from many spot quantitiation is not easy because, a single
gene can be represented by several features on the array, containing
the same, or different DNA sequences. Also the same experiment may
be monitored in the multiple hybridization carried over replicate experiments.
Fig. 4. Gene Quantization, Gene Expression Matrix 
Gene expression data have proven to be highly informative of disease
state, particularly in the area of Oncology (study of tumor), where
accurate and early diagnosis can be proven critical . These data are
carefully recorded and stored in databases, where they can be queried,
compared and analyzed using different software programs. A gene
expression database majorly consists of
_ The gene expression matrix,
_ Sample Annotation and
_ Gene Annotations.
The microarray data extracted have the meaning only to the particular
biological sample and the exact condition under which the samples
were taken this address the sample annotation. For example, if we are
interested in finding out different genes react to treatments with various
chemical compounds used in the experiments. This information
should be specified in the sample annotation. Gene annotation may include
the gene name, sequence information, location of the genome, a
funcional role for a known gene etc. This type of matrix is also called
Annotated gene expression data matrix. A annotated sample gene expression
matrix is as shown in Figure 5.
There are no standard ways for measuring gene expression levels.
Hence every experiment conducted should be stored in the database
with the details of how the gene expression data matrix was obtained.
The Microarray Gene Expression Society (MGED) provides
guidelines for Minimium Information About a Microarrray Experiment
(MIAME), that attempts to define set of information sufficient
to interpret the experiment, and the result obtained. According to the
MIAME a gene expression experiment description may include the
_ Experimental design - experiment type, authors, experimental
factors and citations;
_ Samples used (initial characteristics and treatment history), extract
preparation and labelling, including laboratory protocols;
_ Hybridization procedures and parameters; and
_ Measurement data and specifications of data processing - raw
data, image properties, normalized and summarized data.
As shown in the Figure 6, the Gene Annotation refers to the features
of the particular gene, the Sample annotation refers to the conditions
Fig. 5. Conceptual view of Gene Expression matrix .
on which the particular gene was tested. The remaining values specifies
the expression level of the gene. The columns header and the rows
header of gene annotation and sample annotation differ from experiment
to experiment, for different types of experiment we could select
different conditions and also different gene feature.
Gene expression data analysis can be either classified as,
_ Supervised or
In supervised analysis we use the both the gene annotation and the
sample annotation from beginning of the analysis. A example of a
supervised analysis is sample classification. In this we use sample
annotation to split the set of samples into different classes, for example
healthy or diseased tissues and try to find features in the expression
data that characterizes between these two samples .
In Unsupervised analysis, we do not consider the annotations during
classifying the data. Example of such a analysis may be Gene clustering(
finding set of genes with similar expression pattern) or Sample
clustering (finding samples having similar in terms of similarly expressed
genes). Annotation will be taken in consideration to check
whether the cluster of similarly expressed genes contain those with
similar functional roles .
Visualization is a powerful data mining technique used for finding
patterns in data, and they are used for gene expression data analysis.
Since the gene expression matrices are high dimensional, visualization
can be used in combination with techniques to reduce the dimensionality
such as clustering or principal component analysis. Visualization
helps the biologists to gain greater insight about data. The dataset produced
will have to be normalized so that noise (uninteresting details)
can be eliminated. Normalization is required to correct the measurement
errors and bias the observed data. The error and bias may be
introduced during the hybridization process or due to noise in scanners
or due to environmental conditions.
There are many visualization tools available that overlap with each
other, and hence biologists should select a particular tool that helps
Fig. 6. Sample Data Set
them to gain these insights. We discuss some of the very popular visualization
tools such as Spotfire, GeneSpring, TimeSearcher, Clusterview
etc. Using these tools the biologist can visualize the data according
to their needs. We try to understand which tools are easy for
use and provide expected results. Each visualization tool has its own
advantages and disadvantages. A measure of an effective visualization
can also be its ability to generate unpredicted new insights, beyond
predefined data analysis task . These tools must help the biologist
to get the answer for what they are searching for and also must help
them investigate further.
2 VISUALIZATION METHODS
Visualization methods are primarily used to understand the biological
insights of the data generated by the microarray. As shown in the
Figure 6, each row represents a particular gene and each column represents
experiment conducted on that particular gene. With this information,
it will be easier for the biologist to answer some of the questions
_ What genes have similar profile?
_ What are the features for the similar profile genes?
_ What are the functional behavior of a particular gene?
_ What genes are involved in a particular biological process?
_ What genes are the key elements in a biological process?
_ _ _ _
Biologists use different visualization methods to answer these question
i.e., for example if a biologist is interested to know feature of a
particular gene then Heat-map would be a method which can provide
The dataset obtained after image processing is very huge and difficult
to interpret what each value in it specifies. To understand the
dataset and to gain insight about what these values specify the different
visualization methods are used. There are various approaches to visualize
the microarray data, ranging from viewing the raw image data,
viewing profiles of genes across experiments. In this section we illustrate
common visualization methods used to visualize the microarray
Heatmap are the most popular method used to visualize microarray
data. A Heat Map is a type of plot in which the pivoted (short/wide)
data are presented as a matrix of rows and columns, where the cells
are of equal size and the information represented by the color of the
Fig. 7. Various ways of representing the data
cells is the most important property . A heatmap represents genes
across the rows and experiments done across the columns. Each cell of
the matrix is filled with different colors to differentiate between each
other. A sample view of the heatmap is as shown in the Figure 8.
Heatmaps are usually used when biologist need to find the behavior
of a particular gene across different experiments or conditions. Heat
maps help you find cluster of genes displayed as areas of similar color,
which behave similarly across a set of experiments.
Fig. 8. Example of Heatmap .
Treemaps display data as a set of nested rectangles. Each branch of
the tree is represented as a rectangle which is further tiled with smaller
rectangles to represent the sub branches. This leads to a collection
of embedded rectangular bounding boxes, which readily shows the
hierarchical structure of information space . The parent child relationship
is indicated by enclosing each of the child rectangles with in
the corresponding parent rectangle. The advantage of treemaps is that
it makes efficient use of space; as a result they can display thousands
of items on the screen simultaneously. The disadvantage of treemaps:
lack of edges linking among the nodes might prevent us from understanding
the hierarchical structure of the datasets. The computational
overhead is more when compared to classical tree drawing algorithms.
An example Tree map is as shown in Figure 9.
2.3 Parallel Coordinates
Parallel Coordinates is a multidimensional visualization method which
helps to represent, explore, evaluate and analyze a large volume of
Fig. 9. Tree map .
data. Parallel Coordinates provide many dimensions and therefore the
users can easily identify relationship between the genes and the experiment
conducted on that particular gene. A sample Parallel Coordinate
visualization is shown in Figure 10.
Fig. 10. Parallel Cooridnates to represent microarray data 
2.4 Principle Components Analysis
Principal Components Analysis (PCA) is an exploratory multivariate
statistical technique for simplifying complex data sets . When there
are 'm' genes and 'n' experiments, the goal of PCA is to reduce the
dimensionality of the data matrix by finding 'r' new genes, where r is
less than 'm'. PCA also provides data analysis, where is can be applied
for data reprocessing before creating a cluster. PCA can be used in a
Time Series dataset where the behavior of the genes can be observed.
PCA tires to reduce the dimensions of the data to summarize the most
important part while ignoring the noise .A sample PCA is shown
in Figure 11 . The major disadvantage of PCA is that, it cannot
take nonlinear structures consisting of arbitrary clusters.
Dendrogram is a visual representation of the spot correlation data. The
individual spots are arranged along the bottom of the dendrogram and
Fig. 11. Principal Component Analysis 
are called as Leaf nodes. Spot clusters are formed by combining individual
spots with the join points called a Node. Dendrograms are extremely
used for clustering i.e., Clustering explains about how a given
dataset can be divided or grouped into smaller related dataset to extract
exact information. Dendrograms are used to visualize the nested sequence
of clusters resulting from the hierarchical clustering. The main
advantage of using dendrograms is that the ease of interpretation. A
sample dendrogram is as shown in the Figure 12.
Fig. 12. Sample Dendrogram 
2.6 Scatter Plot
Scatter plots shows the relationships between two variables of the data
points in a two dimensional graph. Scatter plots are mainly used to
map similarities between genes and help biologists to find clusters,
outliers and correlation among data. Scatter plots are useful when
there are a large number of data points. When working with a single
data set, it is possible to look at the expression behavior of a particular
gene . They provide relationship between two variables such as
directions either positive or negative, strength between the variables.
An advantage of scatter plot is that it retains exact data values and
sample size. Also provides details about the outliers. The disadvantage
of scatter plots is that it is difficult to visualize results in a large datasets
i.e., data involving hundreds to thousands or time points cannot be
analyzed with scatter plots. One more problem with scatter plot is that
both the axes should be continuous. Figure 13 shows how would a
scatter plot would look like.
Fig. 13. Example of a Scatter Plot 
2.7 Box Plot
A Box plot is a convenient way of graphically depicting groups of
numerical data through five number of distributions :
_ the smallest observation (sample minimum)
_ lower quartile (Q1)
_ median (Q2)
_ upper quartile (Q3) and
_ largest observation.
Box plot is represented as shown in Figure 14:
Fig. 14. Boxplot 
An example of the Boxplot is as shown in Figure 15. In the example
we can identify the distribution and specify whether they are normal
or skewed. We can also see the outliers in the example.
Boxplots are used to visualize variation with in an array. The advantage
of Boxplots is that it provides some indications of the data's
symmetry and skewness. Boxplots also helps to identify the outliers.
The drawback of Boxplot is that it hides many of the details of the
distribution. These are not visually appealing as other graphs.
Fig. 15. Example of a Boxplot which shows the distribution as well as
the outliers 
2.8 Advantages and Disadvantages of Visualization Methods
This section summaries the advantages and disadvantage of the visualization
method discussed in this section.
Fig. 16. Advantages and Disadvantages of Visualization methods
3 VISUALIZATION TOOLS
Visualization Tools allow users to visualize genes with multiple expression
profiles. With the help of these tools user analyze the microarray
data according to the requirements. Tools provide users with
a option of changing the visualization methods i.e., tools provide more
than one type of visualization method to ease the user to understand
the gene expression profile. There are main tools which provide these
features to users. Some of the widely used visualization tools are discussed
in this section.
3.1 Clusterview and Treeview
ClusterView and TreeView are programs that provide a computational
and graphical environment for analyzing data from DNA microarray
experiments, or other genomic datasets. Clusterview helps the user to
organize and analyze the data in many different ways. Treeview allows
the user to visualize the organized data. Treeview visually represent
the data using the heatmap method. Cluster and Treeview does not
provide other methods of visualization. Another disadvantage is that
Treeview depends on the Cluster view to organize the data i.e., Clusterview
has to first analyze and organize the data only then Treeview
will be capable of displaying the data. Any errors while organizing
the data the effect would propagate to the Treeview too. An example
of a ClusterView and Treeview used to visualize the Lupus dataset is
shown in Figure 17 .
Fig. 17. Clusterview and Treeview on Lupus dataset 
3.2 Hierarchical Clustering Explorer (HCE)
Hierarchical Clustering Explorer(HCE) provides users control over the
data analysis process and enables more interaction with the analysis
result through interactive visual techniques. Users are enabled to perform
exploratory data analysis, establish meaningful hypotheses and
verify results. HCE applies hierarchical clustering without predetermined
a number of clusters and then enables to determine the natural
grouping. HCE provides tools to help users to understand and visualize
the data .
_ Overview Tool provides the user to see the entire dataset and
helps users to identify high level patterns and hot spots.
_ Dynamic Query Tool allows the user to view clusters of varying
size and provides with the option to view detail in a smaller scale.
HCE enables the user to visualize the data using Scatter plots, Dendrogram,
Histogram, Heatmaps, Parallel coordinate etc. A example of
a HCE is shown in Figure 18 .
TimeSearcher uses a different visualization approach which is based
on the idea of parallel coordinates. Microarray dataset is usually composed
with expression level of genes at different times of an experiment
conducted. This makes the microarray data a multivariable, and
thus suitable for parallel coordinates. In TimeSearcher, each gene with
a expression profiles are represented by a line. Genes with similar expression
profile are close to each other as shown in Figure 19 .
Users are allowed to select a particular gene to investigate in detail.
Fig. 18. Hierarchical Clustering Explorer for Lupus Dataset 
But the disadvantage is that when genes having similar expression profile,
it will be difficult to visualize a particular gene. The other major
disadvantage of TimeSearcher is that it does not provide any other visualization
Fig. 19. TimeSearcher 
Spotfire is one of the most widely used software for visualizing data
which provides users to visualize the data using a various visualization
methods such as scatter plots, dendrogram, parallel coordinates
heatmaps, bar graphs, pie charts. Spotfire supports multiple visualization
of data in different ways within a single window. Spotfire has
help user to reduce large amounts of data to extract information about
patterns and relationships and to help to visualize possible underlying
processes. Spotfire provides the user with much interactive functionality
such as zooming; define data ranges, brushing etc. The important
interaction in Spotfire is the Dynamic Query slider which interactively
filters un-interested results. The major advantage of Spotfire is that
it can import data from a number of databases for visualization in a
single session. An example of Spotfire is as shown in Figure 20 .
Fig. 20. Spotfire Software Tool 
GeneSpring are easy-to-use statistical tool which allows user to design
revealing analysis protocols and interpret the results. GeneSpring is
also a interactive tool kit similar to that of Spotfire. Even Genespring
has many ways of visualizing the data such as Scatter plots, Box plots,
Dendrogram, Pathway diagrams etc. It provides several interactive
options such as zooming, customizing visualization by changing color,
range etc. The advantage of GeneSpring is that it has the capabilities
of clustering the data. An example of the GeneSpring software tool is
as shown in the Figure 21 .
Fig. 21. GeneSpring Software Tool 
3.6 Comparison of Visualization Tools
In this section a summary of visualization tools are given. The Figure
22 explains which tool uses what type of visualization method to
The efficiency of the tool also depends on how user friendly the tool
i.e., how easily a user can interact with the tool. The different tools
provide different level of interaction. For example ClusterView and
TreeView provide users with Overview and Detail option. Overview
refers to the overview of a cluster of gene or experiment and detailed
Fig. 22. Comparison of Tools
providing information about particular gene. TimeSearcher also provides
this option but also provides user with brushing and Dynamic
query operation. Brushing refers to selecting a subset of the data
items. HCE and Spotfire provide Overview and Detailed, Dynamic
Query, Zooming and also Brushing option. GeneSpring provides only
zooming and brushing option.
4 EXPERIMENTS STUDIED
In this section I would brief about the experiment conducted by Purvi
Saraiya et al.in . The author has tried to gain insights of the user
about the various visualization tools. According to , the author
has defined insight as " An individual's observation about the data".
The experiment took three different type of microarray dataset namely
Time Series, Viral condition and Lupus vs Control. There were thirty
participants who volunteered to be the part of the experiment. All the
participants had the basic knowledge about microarray technology. All
the participants were given the basic instructions about the visualization
tools. Participants had to comment on their experience as how
they were able to perceive the data and their comfort level in using
the particular tool. Author could then measure the various measure of
insight such as Count of insight, Total Domain value, Average Final
Amount Learned etc. Purvi Saraiya et al. have provided a comparative
study about as which tool provide more insight for the particular
dataset. The result of the experiment is shown in Figure 23. The graph
showed about the total count of insights gained, total insight value,
average total time for each tool.
In the graph, Count of insights defines the total number of insights
obtained for each of the tool. There was no much variance between
Spotfire and GeneSpring in the insights gained. Total domain
value is the sum of all the insight occurrances . Average Total time
is the average time that the user spent using the tool until the participants
could not get any new insight of the data . Lower time
indicates the tool is more efficient.
The Average Total time does not specify that the tool is more efficient,
because according to the graph ClusterView and TreeView takes
the minimum amount of time and we cannot conclude that it is the efficient
tool. This is because Spotfire and Genespring have many ways of
visualizing the same data where as ClusterView and TreeView can visualize
data in either heatmaps or dendrogram. Inspite of having lesser
visualization methods in ClusterView and TreeView, participants had
to spend some time on to gain the insight.
The experiments and the inferences obtained along the survey has
given some important insights about using visualization tools. Among
Fig. 23. Count of Insights, Total domain value and average total time for
each tool 
the various tools explained above Spotfire and GeneSpring are the
tools which provide the user with various visualization methods. They
also provide user with a very good graphic user interface, which help
users to interact with the tool easily. The other tools also provide user
with the result as expected by the user. The effectiveness of a tool
also depends on how comfortable a user is when using the particular
tool. It is not efficient if the tool provides a good interface but the
visualization that it provides is not as expected.
The tool to be used depends on the type of dataset the user has
and the type of information the user would like to gain. If the researcher
just want to find the cluster of genes which has similar behavior,
then the researcher can use ClusterView or TreeView. But if the
requirement is more intense and demands to have more details then
either GeneSpring or Spotfire can be used. The selection of the tool
also depends on the type of dataset is present. Hence it depends on
the researcher, the type of dataset and the requirement that has to be
matched, that particular visualization tool can be selected.