Visualization Of Microarray Data Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract-Visualization plays a vital role in the area of Molecular Biology where biologists are interested in viewing the gene pattern of RNA and DNA. Experiments in detecting the gene pattern with microarrays have led to large datasets of gene expressions. Visualizing these datasets is our main goal. Even though there are many tools available for visualizing these data sets, it may be difficult for the biologist to choose an appropriate tool. The biologist will need different views from these datasets but it is difficult for a single tool to provide all the possible insights of the gene expression from the dataset. Biologists would prefer a tool which would provide them the required information in a simple way and in least possible time. This survey is to study some of the visualization tools that can help the biologists to visualize the dataset obtained from the microarray dataset.


Microarrays are a high throughput technology used to measure expression

level of genes. It is popularly used gene expression profiling tool

to analyze many genes in a single experiment. Microarray is advancing

with numerous applications such as gene expression, genotyping,

and cell biology. Microarray technology allows researchers to analyze

genes and compare the behavior of these genes. Microarray technology

will help researchers learn more about different diseases like heart

disease, infectious disease, mental illness and many more. It also helps

them to find the interaction and relation among the genes. Before the

development of microarray technology, scientist used to classify diseases

based on organs in which tumor develop. With the development

of microarray, scientist are able classify disease based on the gene activity

in tumor cell. The major steps involved in microarray experiment


_ Preparation of microarrays.

_ Preparation of fluorescent labeled cDNA probes and hybridization.

_ Scanning the microarray, image and data analysis Figure 1.

Microarray consists of sequences of Deoxyribonucleic acid (DNA),

proteins, or tissues arranged in the form of arrays for easy understanding

or analysis.

Microarrays are mainly used for:

_ Gene expression studies

_ Disease diagnosis

_ Pharmacogenetics (drug discovery)

_ Toxicogenomics (Study of structure and function of a genome)

The central principle of the microarray technique is the selective

binding of complementary single stranded nucleic acid sequences (hybridization)

and the use of fluorescent probes(Cy3 and Cy5) to visualize

the difference in complementary DNA (cDNA) level that represents

mRNA level [19], [20], [21]. Microarrays are created using the

robotic machines which place the genes on a glass slide or a silicon

chip. The process of building a microarray chip is as shown in the

Figure 1 [18].

Microarray can be broadly classified into 2 types :

1. cDNA array (Spotted Array)

Manuscript received 31 March 2010; accepted 1 August 2010; posted online

24 October 2010; mailed on 16 October 2010.

For information on obtaining reprints of this article, please send

email to:

Fig. 1. Process of building a Microarray Chip [18].

2. High-density Oligonucleotide array(Affymetrix)

The 2 types of Microarrays are as shown in the Figure 2.

Fig. 2. a) Affymetrix Chip b) Spotted Array Chip [18].

In an Affymetrix chip each gene has around 16-20 pairs of probes

synthesized on the chip. Each pairs of probe have two oligonucleoties:

_ Perfect match (PM, reference sequence) ATG...C...TGC (around

20-25 bases)

_ Mis-match (MM, one base change)ATG...T...TGC

The major difference between the two types of microarray is that in

cDNA(Spotted array) we use two different dyes i.e., Cy3 and Cy5 for

labeling the RNA, where as in Affymetrix we use only one type of dye

for labeling RNA.

If the RNA from the healthy sample is in abundance, the spot will

be green, if the RNA from the tumor sample is in abundance, it will

be red. If both are equal, the spot will be yellow; else if neither is

present then the spot will be black. In prepared microarray slide, each

spot contains multiple identical strands of DNA. The DNA sequence

in each spot is unique. This microarray is then scanned to extract the

gene information. The scanning of this microarray is done twice, once

to find the Cy5 fluorescent and second time for Cy3 fluorescent. This

process is as shown in Figure 3.

Fig. 3. Scanning process of Microarray [23]

The raw data obtained by the microarray slide are digital images. To

obtain information about the gene expression levels, these images are

analyzed, each spot on the array identified and its intensity measured

and compared with the values representing the background. Image

quantization is usually carried out using the image analysis software.

The three phases of processing scanned images are

1. Addressing or Gridding

_ Assigning coordinates to each of the spots.

2. Segmentation

_ Classification of pixels either as foreground or as background.

3. Intensity extraction (for each spot)

_ Foreground fluorescence intensity pairs (R, G).

_ Background intensities.

_ Quality measures.

During hybridization some of the probe mRNA will attach to the

array, even when there is no cDNA available. This is known as Background

Intensity. An essential feature of all image analysis software

is that the digital microarray images are processed and the data are

extracted and combined in a table. This is known as a spot quantitation

matrix. Each row corresponds to one spot on the array, and each

column represents the different quantitation characteristics of the spot,

such as median or mean pixel intensity of the spot and local background


The data from multiple hybridizations must be further transformed

and organized in a Gene Expression Matrix as shown in Figure 4. In

this matrix, each row represents a gene, and each column represents

experimental conditions, such as a particular biological sample. Each

position in such a matrix characterizes the expression level of the particular

gene, under a particular experimental condition. Building such

matrices will help us to understand gene regulation, metabolic and signaling

pathways, the genetic mechanisms of disease, and the response

to drug treatments. For instance, if over-expression of certain genes is

correlated with a certain cancer, we can explore which other conditions

affect the expression of these genes and which other genes have similar

expression profiles [29]. Obtaining the gene matrix that combines

information from many spot quantitiation is not easy because, a single

gene can be represented by several features on the array, containing

the same, or different DNA sequences. Also the same experiment may

be monitored in the multiple hybridization carried over replicate experiments.

Fig. 4. Gene Quantization, Gene Expression Matrix [25]

Gene expression data have proven to be highly informative of disease

state, particularly in the area of Oncology (study of tumor), where

accurate and early diagnosis can be proven critical [30]. These data are

carefully recorded and stored in databases, where they can be queried,

compared and analyzed using different software programs. A gene

expression database majorly consists of

_ The gene expression matrix,

_ Sample Annotation and

_ Gene Annotations.

The microarray data extracted have the meaning only to the particular

biological sample and the exact condition under which the samples

were taken this address the sample annotation. For example, if we are

interested in finding out different genes react to treatments with various

chemical compounds used in the experiments. This information

should be specified in the sample annotation. Gene annotation may include

the gene name, sequence information, location of the genome, a

funcional role for a known gene etc. This type of matrix is also called

Annotated gene expression data matrix. A annotated sample gene expression

matrix is as shown in Figure 5.

There are no standard ways for measuring gene expression levels.

Hence every experiment conducted should be stored in the database

with the details of how the gene expression data matrix was obtained.

The Microarray Gene Expression Society (MGED) provides

guidelines for Minimium Information About a Microarrray Experiment

(MIAME), that attempts to define set of information sufficient

to interpret the experiment, and the result obtained. According to the

MIAME a gene expression experiment description may include the

following [31]:

_ Experimental design - experiment type, authors, experimental

factors and citations;

_ Samples used (initial characteristics and treatment history), extract

preparation and labelling, including laboratory protocols;

_ Hybridization procedures and parameters; and

_ Measurement data and specifications of data processing - raw

data, image properties, normalized and summarized data.

As shown in the Figure 6, the Gene Annotation refers to the features

of the particular gene, the Sample annotation refers to the conditions

Fig. 5. Conceptual view of Gene Expression matrix [30].

on which the particular gene was tested. The remaining values specifies

the expression level of the gene. The columns header and the rows

header of gene annotation and sample annotation differ from experiment

to experiment, for different types of experiment we could select

different conditions and also different gene feature.

Gene expression data analysis can be either classified as,

_ Supervised or

_ Unsupervised.

In supervised analysis we use the both the gene annotation and the

sample annotation from beginning of the analysis. A example of a

supervised analysis is sample classification. In this we use sample

annotation to split the set of samples into different classes, for example

healthy or diseased tissues and try to find features in the expression

data that characterizes between these two samples [30].

In Unsupervised analysis, we do not consider the annotations during

classifying the data. Example of such a analysis may be Gene clustering(

finding set of genes with similar expression pattern) or Sample

clustering (finding samples having similar in terms of similarly expressed

genes). Annotation will be taken in consideration to check

whether the cluster of similarly expressed genes contain those with

similar functional roles [30].

Visualization is a powerful data mining technique used for finding

patterns in data, and they are used for gene expression data analysis.

Since the gene expression matrices are high dimensional, visualization

can be used in combination with techniques to reduce the dimensionality

such as clustering or principal component analysis. Visualization

helps the biologists to gain greater insight about data. The dataset produced

will have to be normalized so that noise (uninteresting details)

can be eliminated. Normalization is required to correct the measurement

errors and bias the observed data. The error and bias may be

introduced during the hybridization process or due to noise in scanners

or due to environmental conditions.

There are many visualization tools available that overlap with each

other, and hence biologists should select a particular tool that helps

Fig. 6. Sample Data Set

them to gain these insights. We discuss some of the very popular visualization

tools such as Spotfire, GeneSpring, TimeSearcher, Clusterview

etc. Using these tools the biologist can visualize the data according

to their needs. We try to understand which tools are easy for

use and provide expected results. Each visualization tool has its own

advantages and disadvantages. A measure of an effective visualization

can also be its ability to generate unpredicted new insights, beyond

predefined data analysis task [1]. These tools must help the biologist

to get the answer for what they are searching for and also must help

them investigate further.


Visualization methods are primarily used to understand the biological

insights of the data generated by the microarray. As shown in the

Figure 6, each row represents a particular gene and each column represents

experiment conducted on that particular gene. With this information,

it will be easier for the biologist to answer some of the questions

such as:

_ What genes have similar profile?

_ What are the features for the similar profile genes?

_ What are the functional behavior of a particular gene?

_ What genes are involved in a particular biological process?

_ What genes are the key elements in a biological process?

_ _ _ _

Biologists use different visualization methods to answer these question

i.e., for example if a biologist is interested to know feature of a

particular gene then Heat-map would be a method which can provide

this information.

The dataset obtained after image processing is very huge and difficult

to interpret what each value in it specifies. To understand the

dataset and to gain insight about what these values specify the different

visualization methods are used. There are various approaches to visualize

the microarray data, ranging from viewing the raw image data,

viewing profiles of genes across experiments. In this section we illustrate

common visualization methods used to visualize the microarray


2.1 Heatmap

Heatmap are the most popular method used to visualize microarray

data. A Heat Map is a type of plot in which the pivoted (short/wide)

data are presented as a matrix of rows and columns, where the cells

are of equal size and the information represented by the color of the

Fig. 7. Various ways of representing the data

cells is the most important property [2]. A heatmap represents genes

across the rows and experiments done across the columns. Each cell of

the matrix is filled with different colors to differentiate between each

other. A sample view of the heatmap is as shown in the Figure 8.

Heatmaps are usually used when biologist need to find the behavior

of a particular gene across different experiments or conditions. Heat

maps help you find cluster of genes displayed as areas of similar color,

which behave similarly across a set of experiments.

Fig. 8. Example of Heatmap [23].

2.2 Treemaps

Treemaps display data as a set of nested rectangles. Each branch of

the tree is represented as a rectangle which is further tiled with smaller

rectangles to represent the sub branches. This leads to a collection

of embedded rectangular bounding boxes, which readily shows the

hierarchical structure of information space [23]. The parent child relationship

is indicated by enclosing each of the child rectangles with in

the corresponding parent rectangle. The advantage of treemaps is that

it makes efficient use of space; as a result they can display thousands

of items on the screen simultaneously. The disadvantage of treemaps:

lack of edges linking among the nodes might prevent us from understanding

the hierarchical structure of the datasets. The computational

overhead is more when compared to classical tree drawing algorithms.

An example Tree map is as shown in Figure 9.

2.3 Parallel Coordinates

Parallel Coordinates is a multidimensional visualization method which

helps to represent, explore, evaluate and analyze a large volume of

Fig. 9. Tree map [23].

data. Parallel Coordinates provide many dimensions and therefore the

users can easily identify relationship between the genes and the experiment

conducted on that particular gene. A sample Parallel Coordinate

visualization is shown in Figure 10.

Fig. 10. Parallel Cooridnates to represent microarray data [23]

2.4 Principle Components Analysis

Principal Components Analysis (PCA) is an exploratory multivariate

statistical technique for simplifying complex data sets [4]. When there

are 'm' genes and 'n' experiments, the goal of PCA is to reduce the

dimensionality of the data matrix by finding 'r' new genes, where r is

less than 'm'. PCA also provides data analysis, where is can be applied

for data reprocessing before creating a cluster. PCA can be used in a

Time Series dataset where the behavior of the genes can be observed.

PCA tires to reduce the dimensions of the data to summarize the most

important part while ignoring the noise [23].A sample PCA is shown

in Figure 11 [23]. The major disadvantage of PCA is that, it cannot

take nonlinear structures consisting of arbitrary clusters.

2.5 Dendrogram

Dendrogram is a visual representation of the spot correlation data. The

individual spots are arranged along the bottom of the dendrogram and

Fig. 11. Principal Component Analysis [23]

are called as Leaf nodes. Spot clusters are formed by combining individual

spots with the join points called a Node. Dendrograms are extremely

used for clustering i.e., Clustering explains about how a given

dataset can be divided or grouped into smaller related dataset to extract

exact information. Dendrograms are used to visualize the nested sequence

of clusters resulting from the hierarchical clustering. The main

advantage of using dendrograms is that the ease of interpretation. A

sample dendrogram is as shown in the Figure 12.

Fig. 12. Sample Dendrogram [12]

2.6 Scatter Plot

Scatter plots shows the relationships between two variables of the data

points in a two dimensional graph. Scatter plots are mainly used to

map similarities between genes and help biologists to find clusters,

outliers and correlation among data. Scatter plots are useful when

there are a large number of data points. When working with a single

data set, it is possible to look at the expression behavior of a particular

gene [26]. They provide relationship between two variables such as

directions either positive or negative, strength between the variables.

An advantage of scatter plot is that it retains exact data values and

sample size. Also provides details about the outliers. The disadvantage

of scatter plots is that it is difficult to visualize results in a large datasets

i.e., data involving hundreds to thousands or time points cannot be

analyzed with scatter plots. One more problem with scatter plot is that

both the axes should be continuous. Figure 13 shows how would a

scatter plot would look like.

Fig. 13. Example of a Scatter Plot [12]

2.7 Box Plot

A Box plot is a convenient way of graphically depicting groups of

numerical data through five number of distributions [22]:

_ the smallest observation (sample minimum)

_ lower quartile (Q1)

_ median (Q2)

_ upper quartile (Q3) and

_ largest observation.

Box plot is represented as shown in Figure 14:

Fig. 14. Boxplot [13]

An example of the Boxplot is as shown in Figure 15. In the example

we can identify the distribution and specify whether they are normal

or skewed. We can also see the outliers in the example.

Boxplots are used to visualize variation with in an array. The advantage

of Boxplots is that it provides some indications of the data's

symmetry and skewness. Boxplots also helps to identify the outliers.

The drawback of Boxplot is that it hides many of the details of the

distribution. These are not visually appealing as other graphs.

Fig. 15. Example of a Boxplot which shows the distribution as well as

the outliers [13]

2.8 Advantages and Disadvantages of Visualization Methods

This section summaries the advantages and disadvantage of the visualization

method discussed in this section.

Fig. 16. Advantages and Disadvantages of Visualization methods


Visualization Tools allow users to visualize genes with multiple expression

profiles. With the help of these tools user analyze the microarray

data according to the requirements. Tools provide users with

a option of changing the visualization methods i.e., tools provide more

than one type of visualization method to ease the user to understand

the gene expression profile. There are main tools which provide these

features to users. Some of the widely used visualization tools are discussed

in this section.

3.1 Clusterview and Treeview

ClusterView and TreeView are programs that provide a computational

and graphical environment for analyzing data from DNA microarray

experiments, or other genomic datasets. Clusterview helps the user to

organize and analyze the data in many different ways. Treeview allows

the user to visualize the organized data. Treeview visually represent

the data using the heatmap method. Cluster and Treeview does not

provide other methods of visualization. Another disadvantage is that

Treeview depends on the Cluster view to organize the data i.e., Clusterview

has to first analyze and organize the data only then Treeview

will be capable of displaying the data. Any errors while organizing

the data the effect would propagate to the Treeview too. An example

of a ClusterView and Treeview used to visualize the Lupus dataset is

shown in Figure 17 [1].

Fig. 17. Clusterview and Treeview on Lupus dataset [28]

3.2 Hierarchical Clustering Explorer (HCE)

Hierarchical Clustering Explorer(HCE) provides users control over the

data analysis process and enables more interaction with the analysis

result through interactive visual techniques. Users are enabled to perform

exploratory data analysis, establish meaningful hypotheses and

verify results. HCE applies hierarchical clustering without predetermined

a number of clusters and then enables to determine the natural

grouping. HCE provides tools to help users to understand and visualize

the data [24].

_ Overview Tool provides the user to see the entire dataset and

helps users to identify high level patterns and hot spots.

_ Dynamic Query Tool allows the user to view clusters of varying

size and provides with the option to view detail in a smaller scale.

HCE enables the user to visualize the data using Scatter plots, Dendrogram,

Histogram, Heatmaps, Parallel coordinate etc. A example of

a HCE is shown in Figure 18 [1].

3.3 TimeSearcher

TimeSearcher uses a different visualization approach which is based

on the idea of parallel coordinates. Microarray dataset is usually composed

with expression level of genes at different times of an experiment

conducted. This makes the microarray data a multivariable, and

thus suitable for parallel coordinates. In TimeSearcher, each gene with

a expression profiles are represented by a line. Genes with similar expression

profile are close to each other as shown in Figure 19 [28].

Users are allowed to select a particular gene to investigate in detail.

Fig. 18. Hierarchical Clustering Explorer for Lupus Dataset [1]

But the disadvantage is that when genes having similar expression profile,

it will be difficult to visualize a particular gene. The other major

disadvantage of TimeSearcher is that it does not provide any other visualization


Fig. 19. TimeSearcher [1]

3.4 Spotfire

Spotfire is one of the most widely used software for visualizing data

which provides users to visualize the data using a various visualization

methods such as scatter plots, dendrogram, parallel coordinates

heatmaps, bar graphs, pie charts. Spotfire supports multiple visualization

of data in different ways within a single window. Spotfire has

help user to reduce large amounts of data to extract information about

patterns and relationships and to help to visualize possible underlying

processes. Spotfire provides the user with much interactive functionality

such as zooming; define data ranges, brushing etc. The important

interaction in Spotfire is the Dynamic Query slider which interactively

filters un-interested results. The major advantage of Spotfire is that

it can import data from a number of databases for visualization in a

single session. An example of Spotfire is as shown in Figure 20 [1].

Fig. 20. Spotfire Software Tool [2]

3.5 GeneSpring

GeneSpring are easy-to-use statistical tool which allows user to design

revealing analysis protocols and interpret the results. GeneSpring is

also a interactive tool kit similar to that of Spotfire. Even Genespring

has many ways of visualizing the data such as Scatter plots, Box plots,

Dendrogram, Pathway diagrams etc. It provides several interactive

options such as zooming, customizing visualization by changing color,

range etc. The advantage of GeneSpring is that it has the capabilities

of clustering the data. An example of the GeneSpring software tool is

as shown in the Figure 21 [1].

Fig. 21. GeneSpring Software Tool [3]

3.6 Comparison of Visualization Tools

In this section a summary of visualization tools are given. The Figure

22 explains which tool uses what type of visualization method to

visualize data.

The efficiency of the tool also depends on how user friendly the tool

i.e., how easily a user can interact with the tool. The different tools

provide different level of interaction. For example ClusterView and

TreeView provide users with Overview and Detail option. Overview

refers to the overview of a cluster of gene or experiment and detailed

Fig. 22. Comparison of Tools

providing information about particular gene. TimeSearcher also provides

this option but also provides user with brushing and Dynamic

query operation. Brushing refers to selecting a subset of the data

items. HCE and Spotfire provide Overview and Detailed, Dynamic

Query, Zooming and also Brushing option. GeneSpring provides only

zooming and brushing option.


In this section I would brief about the experiment conducted by Purvi

Saraiya et [1]. The author has tried to gain insights of the user

about the various visualization tools. According to [1], the author

has defined insight as " An individual's observation about the data".

The experiment took three different type of microarray dataset namely

Time Series, Viral condition and Lupus vs Control. There were thirty

participants who volunteered to be the part of the experiment. All the

participants had the basic knowledge about microarray technology. All

the participants were given the basic instructions about the visualization

tools. Participants had to comment on their experience as how

they were able to perceive the data and their comfort level in using

the particular tool. Author could then measure the various measure of

insight such as Count of insight, Total Domain value, Average Final

Amount Learned etc. Purvi Saraiya et al. have provided a comparative

study about as which tool provide more insight for the particular

dataset. The result of the experiment is shown in Figure 23. The graph

showed about the total count of insights gained, total insight value,

average total time for each tool.

In the graph, Count of insights[1] defines the total number of insights

obtained for each of the tool. There was no much variance between

Spotfire and GeneSpring in the insights gained. Total domain

value is the sum of all the insight occurrances [1]. Average Total time

is the average time that the user spent using the tool until the participants

could not get any new insight of the data [1]. Lower time

indicates the tool is more efficient.

The Average Total time does not specify that the tool is more efficient,

because according to the graph ClusterView and TreeView takes

the minimum amount of time and we cannot conclude that it is the efficient

tool. This is because Spotfire and Genespring have many ways of

visualizing the same data where as ClusterView and TreeView can visualize

data in either heatmaps or dendrogram. Inspite of having lesser

visualization methods in ClusterView and TreeView, participants had

to spend some time on to gain the insight.


The experiments and the inferences obtained along the survey has

given some important insights about using visualization tools. Among

Fig. 23. Count of Insights, Total domain value and average total time for

each tool [1]

the various tools explained above Spotfire and GeneSpring are the

tools which provide the user with various visualization methods. They

also provide user with a very good graphic user interface, which help

users to interact with the tool easily. The other tools also provide user

with the result as expected by the user. The effectiveness of a tool

also depends on how comfortable a user is when using the particular

tool. It is not efficient if the tool provides a good interface but the

visualization that it provides is not as expected.

The tool to be used depends on the type of dataset the user has

and the type of information the user would like to gain. If the researcher

just want to find the cluster of genes which has similar behavior,

then the researcher can use ClusterView or TreeView. But if the

requirement is more intense and demands to have more details then

either GeneSpring or Spotfire can be used. The selection of the tool

also depends on the type of dataset is present. Hence it depends on

the researcher, the type of dataset and the requirement that has to be

matched, that particular visualization tool can be selected.