Knowledge Discovery Technique In Bioinformatics Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Computers have brought tremendous improvement in technologies especially the speed of computer and reduced data storage cost which lead to create huge volumes of data. Data itself has no value, unless data changed to information to become useful. In past two decade the data mining was invented to generate knowledge from database. Presently bioinformatics field created many databases, accumulated in speed and numeric or character data is no longer restricted. Data Base Management Systems allows the integration of the various high dimensional multimedia data under the same umbrella in different areas of bioinformatics.

WEKA includes several machine learning algorithms for data mining. Weka contains general purpose environment tools for data pre-processing, regression, classification, association rules, clustering, feature selection and visualization. Also, contains an extensive collection of data pre-processing methods and machine learning algorithms complemented by GUI for different machine learning techniques experimental comparison and data exploration on the same problem. Main objectives of WEKA are to (a) Extracting useful information from data and (b) enable to easily identify a suitable algorithm for generating an accurate predictive model from it.

This paper presents short notes on data mining, basic principles of data mining techniques and comparison on classification techniques using WEKA.


Computers have brought tremendous improvement in technologies especially the speed of computer and data storage cost which lead to create huge volumes of data. Data itself has no value, unless data can be changed to information to become useful. In past two decade the data mining was invented to generate knowledge from database. Data Mining is the method of finding the patterns, associations or correlations among data to present in a useful format or useful information or knowledge[1]. The advancement of the healthcare database management systems creates a huge number of data bases. Creating knowledge discovery methodology and management of the large amounts of heterogeneous data has become a major priority of research. Data mining is still a good area of scientific study and remains a promising and rich field for research. Data mining making sense of large amounts of unsupervised data in some domain[2].

Data mining techniques

Data mining techniques are both unsupervised and supervised.

Unsupervised learning technique is not guided by variable or class label and does not create a model or hypothesis before analysis. Apply the algorithm directly to the data and observe the results. Based on the results a model will be built. A common unsupervised technique is Clustering.

In Supervised learning prior to the analysis a model will be built. To estimate the parameters of the model apply the algorithm to the data. The objective of building supervised learning models is to predict an outcome or category of interest. The biomedical literature on applications of supervised learning techniques is vast. A common supervised techniques used in medical and clinical research is Classification, Statistical Regression and association rules. The learning techniques briefly described below as:


Clustering is a dynamic field of research in data mining. Clustering is an unsupervised learning technique, is process of partitioning a set of data objects in a set of meaningful subclasses called clusters. It is revealing natural groupings in the data. A cluster include group of data objects similar to one another within the same cluster but not similar to the objects in another cluster. The algorithms can be categorized into partitioning, hierarchical, density-based, and model-based methods. Clustering is also called unsupervised classification: no predefined classes.

Association Rule

Association rule in data mining is to find the relationships of items in a data base. It is stated in the form X => Y, where X and Y are sets of attributes which implies that transactions that contain X also contain Y.

Association rules do not represent any sort of causality or correlation between the two item sets.

X Þ Y does not mean X causes Y, so no Causality

X Þ Y can be different from Y Þ X, unlike correlation

Association rules assist in marketing, targeted advertising, floor planning, inventory control, churning management, homeland security, etc.


Classification is a supervised learning method. The classification goal is to predict the target class accurately for each case in the data. Classification is a method of categorizing or assigning class labels to a pattern set under the supervision. Classification is a data mining function consists of assigning a class label of objects to a set of unclassified cases. Data mining classification mechanisms such as Decision trees, K-Nearest Neighbor (KNN), Bayesian network, Neural networks, Fuzzy logic, Support vector machines, etc. Classification methods classified as follows:

Decision tree: Decision trees are powerful classification algorithms. Popular decision tree algorithms include Quinlan's ID3, C4.5, C5, and Breiman et al.'s CART. As the name implies, this technique recursively separates observations in branches to construct a tree for the purpose of improving the prediction accuracy. Decision tree classifier divides space into regions and splits a dataset on the basis of discrete decisions, using certain thresholds on the attribute values. It is widely used classification method as it is easy to interpret and can be represented under the If-then-else rule condition.

Most decision tree classifiers perform classification in two phases: tree-growing (or building) and tree-pruning. The tree building is done in top-down manner. During this phase the tree is recursively partitioned till all the data items belong to the same class label. In the tree pruning phase the full grown tree is cut back to prevent over fitting and improve the accuracy of the tree in bottom up fashion. It is used to improve the prediction and classification accuracy of the algorithm by minimizing the over-fitting. Compared to other data mining techniques, it is widely applied in various areas since it is robust to data scales or distributions.


K-Nearest Neighbor is one of the best known distance based algorithms, in the literature it has different version such as closest point, single link, complete link, K-Most Similar Neighbor etc. Nearest neighbors algorithm is considered as statistical learning algorithms and it is extremely simple to implement and leaves itself open to a wide variety of variations. The K-Nearest Neighbors algorithm is simplest algorithm of all machine learning algorithms. An object is classified based on the majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors. K is a positive integer, typically small. If k = 1, then the object is simply assigned to the class of its nearest neighbor. Nearest neighbor classifiers are based on learning by analogy. The training samples are described by n dimensional numeric attributes. Each sample represents a point in an n-dimensional space. In this way, all of the training samples are stored in an n-dimensional pattern space. Nearest neighbor classifiers are instance-based or lazy learners in that they store all of the training samples and do not build a classifier until a new (unlabeled) sample needs to be classified. KNN has got a wide variety of applications in various fields such as Pattern recognition, Image databases, Internet marketing, Cluster analysis etc. The Table 1 below gives the theoretical comparison on classification techniques.

Nearest-neighbor classifiers [3] find the neighbors if a new instance, and then assign to it the label for the majority class of its neighbors.

Probabilistic (Bayesian Network) models:

Bayesian networks are a powerful probabilistic representation, and their use for classification has received considerable attention. Bayesian algorithms predict the class depending on the probability of belonging to that class. A Bayesian network is a graphical model for probability relationships among a set of variables features. This Bayesian Network consists of two components. First component is mainly a directed acyclic graph (DAG) in which the nodes in the graph are called the random variables and the edges between the nodes or random variables represents the probabilistic dependencies among the corresponding random variables. Second component is a set of parameters that describe the conditional probability of each variable given its parents. The conditional dependencies in the graph are estimated by statistical and computational methods. Thus the BN combine the properties of computer science and statistics.

Probabilistic models calculate probabilities for hypotheses base on Bayes' theorem [3].

Statistical Regression

Regression models are very popular in the biomedical literature and have been applied in virtually every sub-specialty of medical research. Before computers were widely used, linear regression was the most popular model to find solutions of the problem of estimating the intercept and coefficients of the regression question. It has solid foundation from the statistical theory. Linear regression is similar to the task of finding the line that minimizes the total distance to a set of data. That is find the equation for line Y = a + bX. With the help of computers and software package, we can calculate the high complex models.

Artificial Neural Networks

Artificial neural networks [4] are signal processing systems that try to emulate the behavior of human brain by providing a mathematical model of combination of numerous neurons connected in a network. It learns through examples and discriminate the characteristics among various pattern classes by reducing the error and automatically discovering inherent relationships in a data-rich environment. No rules or programmed information is need beforehand. It composes of many elements, called nodes which are connected in between. The connection between two nodes is weighted and by the adjustment of these weights, the training of the network is performed. The weights are network parameters and their values are obtained after the training procedure. There are usually several layers of nodes. During the training procedure, the inputs are directed in the input layer with the desirable output values as targets. A comparison mechanism will operates between the out and the target value and the weights are adjusted in order to reduce error. The procedure is repeated until the network output matches the targets. There are many advantages of neural networks like adaptive learning ability, self-organization, real-time operation and insensitivity to noise. However, it also has a huge disadvantage that it is highly dependence on the training data and it does not provide an explanation for the decisions they make, just like working in the 'black box'.

Advanced Data Mining Techniques

During the past few years, researchers have tried to combine both unsupervised and supervised methods for the analysis [5]. Some examples of advanced unsupervised learning models are hierarchical clustering, c-means clustering self-organizing maps (SOM) and multidimensional scaling techniques. Advanced examples of the supervised learning models classification and regression trees (CART) and support vector machines [6].

DM is commonly used in marketing, surveillance, fraud detection, artificial intelligence, scientific discovery and now gaining a broad way in other fields also.


Bioinformatics and Data mining provide challenging and exciting research for computation. Bioinformatics is conceptualizing biology in terms of molecules and then applying "informatics techniques to understand and organize the information associated with these molecules on a large scale. It is MIS for molecular biology information. It is the science of managing, mining, and interpreting information from biological sequences and structures. Advances such as genome-sequencing initiatives, microarrays, proteomics and functional and structural genomics have pushed the frontiers of human knowledge. Data mining and machine learning have been advancing with high-impact applications from marketing to science. Although researchers have spent much effort on data mining for bioinformatics, the two areas have largely been developing separately. In classification or regression the task is to predict the outcome associated with a particular individual given a feature vector describing that individual; in clustering, individuals are grouped together because they share certain properties; and in feature selection the task is to select those features that are important in predicting the outcome for an individual.

We believe that data mining will provide the necessary tools for better understanding of gene expression, drug design, and other emerging problems in genomics and proteomics. Propose novel data mining techniques for tasks such as

Gene expression analysis,

Searching and understanding of protein mass spectroscopy data,

3D structural and functional analysis and mining of DNA and protein sequences for structural and functional motifs, drug design, and understanding of the origins of life, and

Text mining for biological knowledge discovery.

Different type of biomedical data:

In "Building Innovative Representations of DNA Sequences to Facilitate Gene Finding," Jianbo Gao, Yinhe Cao, Yan Qi, and Jing Hu propose a way to determine the best discrimination of noncoding and coding regions of genomic DNA sequences. To do this, their approach devises two codon indices based on a new representation of DNA sequences.

In "MicroCluster: Efficient Deterministic Biclustering of Microarray Data," Lizhuang Zhao and Mohammed Zaki present their algorithm for mining gene expression data. MicroCluster first constructs a range multigraph from the microarray data and then searches for constrained maximal cliques to get all qualified biclusters (a bicluster is a set of genes and samples arranged in a matrix). This method can discover arbitrarily positioned and overlapping clusters of genetic data.

In "Finding Protein Domain Boundaries:An Automated, Non-Homology-Based Method," Brian Gurbaxani and Parag Mallick mine protein sequence data to reveal subtle variations of the sequences' amino acid composition.

They have developed a Bayesian algorithm that identifies structural domains in proteins by cataloging the occurrence of groups of amino acids.

In "Choosing the Optimal Hidden Markov Model for Secondary-Structure Prediction," Juliette Martin, Jean-François Gibrat, and François Rodolphe handle the secondary structural data of proteins. Their approach assumes that a model is good if it can achieve the best compromise between the number of parameters and prediction accuracy.

Finally, in "Using Semantic Dependencies to Mine Depressive Symptoms from Consultation Records," Chung-Hsien Wu, Liang- Chih Yu, and Fong-Lin Jang mine depressive symptoms from psychiatric-consultation records (text data). To discover the symptoms, their framework integrates a sentence's semantic dependencies and the strength of the lexical cohesion between sentences. It also uses domain ontology to mine relations between the extracted symptoms.

In today's world large quantities of data is being accumulated and seeking knowledge from massive data is one of the most fundamental attribute of Data Mining. It consists of more than just collecting and managing data but to analyze and predict also. Data could be large in size & in dimension. Also there is a huge gap from the stored data to the knowledge that could be construed from the data. Here comes the classification technique and its sub-mechanisms to arrange or place the data at its appropriate class for ease of identification and searching. Thus classification can be outlined as inevitable part of data mining and is gaining more popularity.

WEKA data mining software

WEKA is data mining software developed by the University of Waikato in New Zealand. Weka includes several machine learning algorithms for data mining tasks. The algorithms can either call from your own Java code or be applied directly to a dataset, since WEKA implements algorithms using the JAVA language. Weka contains general purpose environment tools for data pre-processing, regression, classification, association rules, clustering, feature selection and visualization.

The Weka data mining suite in the bioinformatics arena it has been used for probe selection for gene expression arrays[14], automated protein annotation[7][9], experiments with automatic cancer diagnosis[10], plant genotype discrimination[13], classifying gene expression profiles[11], developing a computational model for frame-shifting sites[8] and extracting rules from them[12]. Most of the algorithms in Weka are described in[15].

WEKA includes algorithms for learning different types of models (e.g. decision trees, rule sets, linear discriminants), feature selection schemes (fast filtering as well as wrapper approaches) and pre-processing methods (e.g. discretization, arbitrary mathematical transformations and combinations of attributes). Weka makes it easy to compare different solution strategies based on the same evaluation method and identify the one that is most appropriate for the problem at hand. It is implemented in Java and runs on almost any computing platform.

The Weka Explorer

Explorer is the main interface in Weka, shown in figure 1. Open file… load data in various formats ARFF, CSV, C4.5, and Library.

WEKA Explorer has six (6) tabs, which can be used to perform a certain task. The tabs are shown in figure 2.

Preprocess: Preprocessing tools in WEKA are called "Filters". The Preprocess retrieves data from a file, SQL database or URL (For very large datasets sub sampling may be required since all the data were stored in main memory). Data can be preprocessed using one of Weka's preprocessing tools. The Preprocess tab shows a histogram with statistics of the currently selected attribute. Histograms for all attributes can be viewed simultaneously in a separate window. Some of the filters behave differently depending on whether a class attribute has been set or not. In particular, the supervised filters require a class attribute to be set, and some of the unsupervised attribute filters will skip the class attribute if one is set. Note that it is also possible to set Class to None, in which case no class is set.

Classify: Classify tools can be used to perform further analysis on preprocessed data. If the data demands a classification or regression problem, it can be processed in the Classify tab. Classify provides an interface to learning algorithms for classification and regression models (both are called "classifiers" in Weka), and evaluation tools for analyzing the outcome of the learning process. No matter which evaluation method is used, the model that is output is always the one build from all the training data. WEKA consists of all major learning techniques for classification and regression: Bayesian classifiers, decision trees, rule sets, support vector machines, logistic and multi-layer perceptrons, linear regression, and nearest-neighbor methods. It also contains "metalearners" like bagging, stacking, boosting, and schemes that perform automatic parameter tuning using cross-validation, cost-sensitive classification, etc. Learning algorithms can be evaluated using cross-validation or a hold-out set, and Weka provides standard numeric performance measures (e.g. accuracy, root mean squared error), as well as graphical means for visualizing classifier performance (e.g. ROC curves and precision-recall curves). It is possible to visualize the predictions of a classification or regression model, enabling the identification of outliers, and to load and save models that have been generated.

Cluster: Cluster tools gives access to Weka's clustering algorithms such as k-means, a heuristic incremental hierarchical clustering scheme and mixtures of normal distributions with diagonal co-variance matrices estimated using EM. Cluster assignments can be visualized and compared to actual clusters defined by one of the attributes in the data.

Associate: Associate tools having generating association rules algorithms. It can be used to identify relationships between groups of attributes in the data.

Select attributes: Attribute selection involves searching through all possible combinations of attributes in the data to find which subset of attributes works best for prediction. More interesting in the context of bioinformatics is the fifth tab, which offers methods for identifying those subsets of attributes that are predictive of another (target) attribute in the data. Weka contains several methods for searching through the space of attribute subsets, evaluation measures for attributes and attribute subsets. Search methods such as best-first search, genetic algorithms, forward selection, and a simple ranking of attributes. Evaluation measures include correlation- and entropy based criteria as well as the performance of a selected learning scheme (e.g. a decision tree learner) for a particular subset of attributes. Different search and evaluation methods can be combined, making the system very flexible.

Visualize: Visualization tools shows a matrix of scatter plots for all pairs of attributes in the data. Visualization tab is to visualize 2D plots of the current relation. Any matrix element can be selected and enlarged in a separate window, where one can zoom in on subsets of the data and retrieve information about individual data points. A "Jitter" option for exposing obscured data points is also provided. Jitter is a random displacement given to all points in the plot. Dragging it to the right increases the amount of jitter, this is useful for spotting concentrations of points. Without jitter, a million instances at the same point would look no different to just a single lonely instance.

interfaces to Weka

All the learning techniques in Weka can be accessed from the simple command line (CLI), as part of shell scripts, or from within other Java programs using the Weka API. WEKA commands directly execute using CLI.

Weka also contains an alternative graphical user interface, called "Knowledge Flow," that can be used instead of the Explorer. Knowledge Flow is a drag-and-drop interface and supports incremental learning. It caters for a more process-oriented view of data mining, where individual learning components (represented by Java beans) can be connected graphically to create a "flow" of information.

Finally, there is a third graphical user interface-the "Experimenter"-which is designed for experiments that compare the performance of (multiple) learning schemes on (multiple) datasets. Experiments can be distributed across multiple computers running remote experiment servers and conducting statistical tests between learning scheme.


Classification is one of the most popular techniques in data mining. In this paper we compared algorithms based on their accuracy, learning time and error rate. We observed that, there is a direct relationship between execution time in building the tree model and the volume of data records and also there is an indirect relationship between execution time in building the model and attribute size of the data sets. Through our experiment we conclude that Bayesian algorithms have good classification accuracy over above compared algorithms. To make bioinformatics lively research areas broaden to include new techniques.