0115 966 7955 Today's Opening Times 10:00 - 20:00 (BST)

Data Mining Techniques in DNA Microarray Data

Published: Last Edited:

Disclaimer: This essay has been submitted by a student. This is not an example of the work written by our professional essay writers. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

  • Nur Muyassarah Mohd Azmin



In this paper, we will find out the relation between data mining techniques that is used in DNA microarray data. With this, we’ll know how the data mining will helps in finding the results for bioinformaticians in using the DNA Microarray Data. A framework may be a gradable directory that encapsulates shared resources, like a dynamic shared library, nib files, image files, localized strings, header files, and reference documentation in a very single package. Multiple applications will use all of those resources at the same time. The system masses them into memory and shares the one copy of the resource among all applications whenever potential.

  1. Introduction to DNA and proteins

All organisms on Earth, apart from viruses, consist of cells. Paramecium, for example, has one cell, while we, humans have trillions of cells. All cells have a nucleus, and inside nucleus there is DNA, which very essential to encode the “program” for making future organisms. DNA has coding and non-coding segments, genetic material called “genes”, specify the structure of proteins, which are giant molecules, like haemoglobin, that do the essential add each organism. Practically all cells within the same organism have identical genes, but genes are expressed at different times and under different conditions. Genes is turns into proteins in two steps, firstly, the DNA is transcribed into messenger RNA or mRNA, which then will be translated into proteins. The different patterns of gene expression following carefully tuned biological programs, according to tissue sort, organic process stage, setting and genetic background account for the huge form of different cells states and kinds. Virtually all major differences in cell state or type are correlative with changes within the mRNA levels of many genes.

  1. Microarray

In recent years there has been associate in nursing explosion within the rate of acquisition of biomedical data. Advances in genetics technologies, such as deoxyribonucleic acid microarrays enable for the initial time to get a "global" view of the cell. For instance, we can now habitually investigate the biological molecular state of a cell measuring the simultaneous of thousands of genes using DNA microarrays. Different types of microarray use completely different technologies for measuring informational RNA expression levels, elaborate description of those technologies is on the far side the scope of this paper. Here we have a tendency to focus on the analysis of knowledge from Affymetrix arrays, which are currently one amongst the foremost widespread business arrays. However, the methodology for analysis of knowledge from different arrays would be similar, and it would use completely different technology-specific knowledge preparation and cleaning steps. This type of microarray could be a semiconductor device that may live the expression levels of thousands of genes at the same time. This is done by interbreeding a posh mixture of mRNAs, derived from tissue or cells, to microarrays that display probes for various genes covered during a grid-like fashion. Interbreeding events area unit detected employing a dyestuff and a scanner that may observe fluorescence intensities. The scanners and associated software system perform various sorts of image analysis to live and report raw organic phenomenon values. This permits for a quantitative readout of organic phenomenon on a gene-by-gene basis. As of 2003, there are one-chip microarrays that live expression of over thirty thousand genes, covering most of the human ordination. Microarrays have opened the likelihood of making knowledge sets of molecular info to represent several systems of biological or clinical interest. Organic phenomenon profiles will be used as inputs to large-scale knowledge analysis, for instance, to serve as fingerprints to build additional correct molecular classification, to get hidden taxonomies or to extend our understanding of traditional and disease states. The first generation of microarray analysis methodologies developed over the last five years has incontestable that expression data will be utilized in a spread of sophistication discovery or class prediction biomedical problems including those relevant to tumour classification. Machine learning and statistical techniques applied to organic phenomenon knowledge are accustomed address the questions of characteristic growth morphology, predicting post treatment outcome, and finding molecular markers for illness. Today the microarray-based classification of various morphologies, lineages and cell histologist will be performed successfully in several instances. The performance in predicting treatment outcome or drug response has been additional restricted however some of the results area unit quite promising. Most results of microarray analysis still need any experimental validation and follow up study. Several current efforts area unit being directed in this direction. During a few cases the results of microarray analysis have found their means into additional serious thought in clinical use.

Figure 1: Affymetrix GeneChip (right), its grid (centre) and a cell in a grid (left).

Figure 2: An example raw microarray image for one sample (image courtesy of Affymetrix). The intensity of image on the left is translated by microarray software into numbers just like the ones on the right.

  1. Microarray Data Analysis

Microarray information sets are normally terribly massive, and analytical exactitude is influenced by variety of variables. With that, it's extremely useful to cut back the dataset to those genes that are best distinguished between the 2 cases or classes, example, traditional versus diseases. Such analyses manufacture a listing of genes whose expression is taken into account to alter and referred to as differentially expressed genes. Identification of differential organic phenomenon is that the first task of a full microarray analysis. There are two common methods for in depth microarray data analysis, example, clustering and classification. Clustering is one in all the unattended approaches to classify information into teams of genes or samples with similar patterns that are characteristic to the cluster. Classification is supervised learning and additionally referred to as category prediction or discriminant analysis. Generally, classification could be a method of “learning-from-examples”. Given a set of pre-classified examples, the classifier learns to assign an unseen test suit to one of the categories. There are three main types of the data analysis needed to represent in the DNA microarray techniques, they are:

  1. Gene Selection

Based on data mining, this process is called attributes selection, which helps in finding the genes most strongly related to the class.

  1. Classification

This process helps to classify the diseases or predicting the outcome based on the gene expression patterns, and also helps in identifying the best treatment for the given genetic signature.

  1. Clustering

This process is to find the new biological classes or refining the existing ones.

Identification of many differently expressed genes or gene selection

Differentially expressed genes are the genes whose expression levels are completely different between two teams of experiments. The genes are used to locate potential drug targets and biomarkers. Within the earlier stage, easy “fold change” approach was accustomed realize variations beneath assumption that changes higher than some threshold, were biologically vital. There are many applied math strategies were used later to see either the expression or relative expression of a citrons from normalized microarray knowledge, t tests, changed t-test, two-sample t tests, F-statistic and Bayesian models. For a lot of advanced datasets with multiple categories, Analysis of Variance (ANOVA) techniques were used. Varied computer code packages are developed and obtainable to spot changes in expression using the higher than applied math strategies.


Classification is additionally called category of prediction, discriminant analysis, or supervised learning. Given a group of pre-classified examples, (for example, completely different varieties of cancer categories such as AML and ALL) a classifier can realize a rule that can enable to assign new samples to one of the higher than categories. For classification task, one should have spare sample numbers to enable a rule to be trained better-known as coaching take a look at and then, to have it take a look at, on a freelance set of samples acknowledged as test set. Victimisation normalized factor expression information as input vectors, classification rules is built. There are a good range of algorithms which will be used for classification, together with k Nearest Neighbours (kNN), Artificial Neural Networks, weighted vote and support vector machines (SVM). The promising application of classification is in clinical nosology to seek out illness varieties and subtypes. Widespread examples includes finding categories of malignant neoplastic disease (ALL or AML), five categories of tumour (MD classis, MD desmoplastic, PNET, rhabdoide, glioblastoma) and four categories of malignant neoplastic disease.

Clustering Analysis

Clustering is that the most well-liked methodology presently utilized in the primary step of organic phenomenon information matrix analysis. It’s used for locating co-regulated and functionally connected teams. Clustering is especially fascinating within the cases once we have complete sets of Associate in nursing organism’s genes. There are unit three common kinds of clustering ways, example, stratified clustering, k-means clustering and self-organizing maps. Stratified clustering may be a normally used unattended technique that builds clusters of genes with similar patterns of expression. This is often done by iteratively grouping along genes that area unit extremely related to in terms of their expression measurements, then continued the method on the teams themselves. It’s a way of cluster analysis that seeks to make a hierarchy of clusters. A dendrogram represents all genes as leaves of an oversized, branching tree. The amount and size of expression patterns inside a knowledge set may be calculable quickly, though the division of the tree into actual clusters is commonly performed visually. It usually falls into two classes, example, agglomerate and factious. Agglomerate may be a bottom up approach wherever every observation starts in its own cluster and pairs of clusters area unit incorporate united moves up the hierarchy. Factious may be a prime down approach, example, all observations begin in one cluster and splits area unit performed recursively united moves down the hierarchy.

Knowledge that we discovered using microarray

Classification, clustering and identification of differential genes are often considered as basic microarray data analysis tasks with gene expression profiles alone. However, gene expression profiles may be linked to other external resources to form new discoveries and knowledge. A number of the common applications that addressed with gene expression data with other biomedical information will be discuss below:

  1. Identification of transcription factor binding site

The identification of useful components like transcription-factor binding sites (TFBS) on a whole-genome level is that the next challenge for genome sciences and gene-regulation studies. Transcription factors act as essential molecular switches within the gene expression identification. Transcription factors plays a distinguished role in transcription regulation, distinguishing and characterizing their binding sites is central to expansion genomic regulative regions and understanding gene-regulatory networks. Numerous teams have exploited this drawback and discovered acknowledged binding sites within the promoter regions of genes that area unit co-expressed.

  1. Proteins interaction network and pathway analysis

Protein-protein interactions (PPI) are helpful tools for work the cellular functions of genes. It’s a core of the complete interatomic system of any living cell. PPI improves our understanding of diseases and may give the premise for brand new therapeutic approaches. Many databases that are developed to store macromolecule interactions like the Biomolecule Interaction info (BIND), info of Interacting Proteins (DIP), IntAct, and STRING and also the Molecular Interaction info (MINT). Combining coexpressed similarly as interacting citrons within the same cluster many meaningful predictions associated with gene functions, organic process prelateship’s and pathways is created. Obviously, following promising methodology for analysing microarray knowledge is pathway analysis because it involves the cascade of network interactions. Analysing the microarray knowledge in a very pathway perspective could lead on to the next level of understanding of the system. This integrates the normalized array knowledge and their annotations, like metabolic pathways and citrons metaphysics and purposeful classifications. Metabolic pathway analysis will establish a lot of delicate changes in expression than the citrons lists that result from univariate applied math analysis.

  1. Gene Set Enrichment Analysis

Gene Set Enrichment Analysis (GSEA) may be a procedure technique that determines whether or not a group of genes shows statistically vital and concordant variations between two biological states. The factor sets area unit outlined supported prior biological information, for example, printed data concerning organic chemistry pathways, situated within the same genetic science band, sharing a similar factor metaphysics class, or any user-defined set. The goal of GSEA is to see whether or not members of a factor set tend to occur toward the highest (or bottom) of the list, during which case the factor set is correlate with the makeup category distinction.

  1. Summary

Microarrays are a revolutionary new technology with nice potential to supply correct medical specialty, facilitate realize the correct treatment and cure for several diseases and supply an in depth genome-wide molecular portrait of cellular states. DNA Microarray may be a revolutionary technology and microarray experiments turn out significantly additional information than different techniques. Desegregation organic phenomenon information with different medical specialty resources can offer new mechanistic or biological hypotheses. However, innovative applied math techniques and computing code area unit essential for the thriving analysis of microarray information. This review shows the present bioinformatics tools and also the promising applications for analysing information from microarray experiments. The assorted information analysis views and software mentioned within the paper can facilitate the biological experience as a decent foundation for process analysis of microarray information.

  1. References

[1] Xiang ZY et al. 2003. Microarray expression profiling: Analysis and applications. CURRENT OPINION IN DRUG DISCOVERY & DEVELOPMENT 6 (3): 384-395 MAY 2003.

[2] Marchal K et al Comparison of different methodologies to identify differentially expressed genes in two-sample cDNA microarrays. JOURNAL OF BIOLOGICAL SYSTEMS 10 (4): 409-430 DEC (2002).

[3] Eisen M. et al. Cluster analysis and display of genome-wide expression patterns. PNAS, 95:14863-14868 (1998).

[4] Cunliffe H.E. et al. The Gene Expression Response of Breast Cancer to Growth Regulators: Patterns and Correlation with Tumor Expression Profiles. Cancer Research, 63:7158-7166. (2003).

[5] Mootha VK. et al. PGC-1a Responsive Genes Involved in Oxidative Phosphorylation are Coordinately Down regulated in Human Diabetes. Nature Genet. 15 June 2003, vol. 34 no. 3 pp 267 – 273.

[6] Califano, A. et al Analysis of gene expression microarrays for phenotype classification. Proceedings of ISMB 2000.

[7] Cheng, Y and G.M. Church, Biclustering of expression data. Proceedings of ISMB 2000.

[8] Kohane I et al Microarrays for an Integrative Genomics MIT Press, August 2002. SIGKDD.

[9] Schena M et al. Science 1995 270(5235): 467 [PMID: 7569999].

[10] DeRisi JL et al. Science 1997 278(5338): 680 [PMID: 9381177].

[11] Lockwood WW et al. Eur J Hum Genet. 2006 14(2): 139 [PMID: 16288307].

[12] Kerr MK et al. J Comput Biol. 2000 7: 819 [PMID: 11382364].

[13] Eisen MB et al. Proc Natl Acad Sci U S A. 1998 95: 14863 [PMID: 9843981].

[14] Segal, E. Decomposing Gene Expression into Cellular Processes. Proceedings of PSB 8:89 100(2003).

[15] Mootha et al. Integrated Analysis of Protein Composition, Tissue Diversity, and Gene Regulation in Mouse Mitochondria. Cell 115: 629-640 (2003).

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Request Removal

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please click on the link below to request removal:

More from UK Essays

We can help with your essay
Find out more
Build Time: 0.0030 Seconds