The extraction of useful information from huge datasets provides knowledge in decision making process and Knowledge Discovery makes use of Artificial Intelligence (AI) algorithms such as 'k-means clustering', 'Decision Trees (ID3)', 'Neural Network (NNs)' and 'Data Visualization (2D or 3D scattered graphs)', which will enable prediction, classification, interpretation and visualization on a dataset. These tasks are crucial and play pivotal roles to extract useful information and knowledge from datasets. It is clear that no single data mining algorithm is capable to perform all these tasks; therefore, a unified model of these algorithms is proposed, called Unified Model of Data Mining (UMDM) Algorithms, which accomplishes all the described tasks.
2. Literature Review
The purpose of data mining is to verify the hypothesis prepared by the user and to discover or uncover new patterns from the large datasets. The data mining techniques help to find the relationships between multiple parental variables and the outcomes they influence. Data mining algorithms are used in different fields of study like bioinformatics, genetics, medicine and education due to their robustness, scalability and efficiency. The classification, clustering, regression model and association rule learning are the main areas of data mining algorithms . Table 1 shows the capabilities and tasks that the different data mining algorithms can perform.
Get your grade
or your money back
using our Essay Writing Service!
Table 1: Functions of Different Data Mining Algorithms
The current data mining algorithms and techniques are designed for individual problems, such as classification or clustering, but there is no unifying theory. A theoretical and practical framework is designed that unifies different data mining tasks including clustering, classification, prediction, visualization and association rules using multiagent system in distributed environment. The choice of the algorithm depends on the intended use of extracted knowledge. The data can be used either to predict future behavior or to describe patterns in an understandable form within discover process.
Top 10 problems in the field of data mining are:
• Developing a unifying theory of data mining
• Scaling up for high dimensional data and high speed data streams
• Mining sequence data and time series data
• Mining complex knowledge from complex data
• Data mining in a network setting
• Distributed data mining and mining multi-agent data
• Data mining for biological and environmental problems
• Data mining process-related problems
• Security, privacy and data integrity
• Dealing with non-static, unbalanced and cost-sensitive data 
The following data mining algorithms are used in UMDM for clustering, classification, prediction, visualization and association rules.
2.1 Neural Networks
The neural networks are used for discovering complex or unknown relationships in dataset. They detect patterns from the large datasets for prediction or classification, also used in system performing image and signal processing, pattern recognition, robotics, automatic navigation, prediction and forecasting and simulations. The NNs are more effective and efficient on small to medium sized datasets. The data must be trained first by NNs and the process it goes through is considered to be hidden and therefore left unexplained . The neural network starts with an input layer, where each node corresponds to a predictor variable. These input nodes are connected to a number of nodes in a hidden layer. Each input node is connected to every node in the hidden layer. The nodes in the hidden layer may be connected to nodes in another hidden layer, or to an output layer. The output layer consists of one or more response variables. Figure 2 illustrates the different layers .
Always on Time
Marked to Standard
Figure 2: A Neural Network with one hidden layer
Limitations of NNs
The process it goes through is considered by most to be hidden and therefore left unexplained. This lack of explicitness may lead to less confidence in the results and a lack of willingness to apply those results from data mining, since there is no understanging of how the results came about. It is abovious this as the data sets variable increase in size, that it will become more difficult to understand how it come to its conclusion. The algorithm is better suited to learning on small to medium sized datasets as it becomes too time inefficient on large sized datasets .
2.2 Decision Tree (C4.5) Algorithm
The decision tree algorithm is used as an efficient method for producing classifiers from data. The goal of supervised learning is to create a classification model, known as a classifier, which will predict, with the values of its available input attributes, the class for some entity. In other words, classification is the process of dividing the samples into pre-defined groups. It is used for decision rules as an output. In order to do mining with the decision trees, the attributes have continuous discrete values, the target attribute values must be provided in advance and the data must be sufficient so that the prediction of the results will be possible. Decision trees are faster to use, easier to generate understanding rules and simpler to explain since any decision that is made can be understood by viewing path of decision. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The decision rules are obtained in the form of if-then-else, which can be used for the decision support systems, classification and prediction. Figure 3 illustrates how decision rules are obtained from decision tree algorithm.
Figure 3: Decision Rules from a C4.5 Algorithm
The different steps of decision tree (C4.5) algorithm are given below:
Step 1: Let 'S' is a training set. If all instances in 'S' are positive, then create 'YES' node and halt. If all instances in 'S' are negative, create a 'NO' node and halt. Otherwise select a feature 'F' with values v1,...,vn and create a decision node.
Step 2: Partition the training instances in 'S' into subsets S1, S2, ..., Sn according to the values of V.
Step 3: Apply the algorithm recursively to each of the sets Si.
The decision tree algorithm generates understandable rules, performs classification without requiring much computation, suitable to handle both continuous and categorical variables and provides an indication for prediction or classification .
Limitation of C4.5
It is good for small problems but quickly becomes cumbersome and hard to read for intermediate-sized problems. Special software is required to draw that tree. If there is a noise in the learning set, it will fail to find a tree. The data must be interval or categorical. Any data not in this format will have to be recorded to this format. This process could hide relationships. Over fitting, large set of possible hypotheses, pruning of the tree is required. C4.5 generally represents a finite number of classes or possibilities. It is difficult for decision makers to quantify a finite amount of variables. This sometimes affects the accuracy of the output, hence misleading answer. If the list of variables increases the if-then statements created can become more complex. This method is not useful for all types of data mining, such as time series .
2.3 k-means Clustering Algorithm
The 'k', in the k-means algorithm stands for number of clusters as an input and the 'means' stands for an average, location of all the members of a particular cluster. The algorithm is used for finding the similar patterns due to its simplicity and fast execution. This algorithm uses a square-error criterion in equation 1 for re-assignment of any sample from one cluster to another, which will cause a decrease in the total squared error.
This Essay is
a Student's Work
This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.Examples of our work
Where (F - C)2 is the distance between the datapoints. It is easy to implement, and its time and space complexity are relatively small. Figure 4 illustrates the working of clustering algorithms .
Clusters of Dataset
Figure 4: The Function of the Clustering Algorithms
The different steps of k-means clustering algorithm are given below:
Step 1: Select the value of 'k', the number of clusters.
Step 2: Calculate the initial centroids from the actual sample of dataset. Divide datapoints into 'k' clusters.
Step 3: Move datapoints into clusters using Euclidean's distance formula in equation 2. Recalculate new centroids. These centroids are calculated on the basis of average or means.
Step 4: Repeat step 3 until no datapoint is to be moved.
Where d(xi, xj) is the distance between xi and xj. xi and xj are the attributes of a given object, where i, j and k vary from 1 to N where N is total number of attributes of that given object, indexes i, j, k and N are all integers . The k-means clustering algorithm is applied in number of areas like, Marketing, Libraries, Insurance, City-planning, Earthquake studies, www and Medical Sciences .
Limitations of k-means clustering algorithm
The algorithm is only applicable to datasets where the notion of the mean is defined. Thus, it is difficult to apply to categorical data sets. There is, however, a variation of the k-means algorithm called k-modes, which clusters categorical data. The algorithm uses the mode instead of the mean as the centroid. The user needs to specify the number of clusters k in advance. In practice, several k values are tried and the one that gives the most desirable result is selected. The algorithm is sensitive to outliers. Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values. The algorithm is sensitive to initial seeds, which are the initially selected centroids. Different initial seeds may result in different clusters. Thus, if the sum of squared error is used as the stopping criterion, the algorithm only achieves local optimal. The global optimal is computationally infeasible for large data sets. The algorithm is not suitable for discovering clusters that are not hyper-ellipsoids or hyper-spheres .
2.4 Data Visualization
This method provides the better understanding of data to the users. Graphics and visualization tools better illustrate the relationship among data and their importance in data analysis cannot be overemphasized. The distributions of values can be displayed by using histograms or box plots. 2D or 3D scattered graphs can also be used. Visualization works because it provides the broader information as opposed to text or numbers. The missing and exceptional values from data, the relationships and patterns within the data are easier to identify when graphically displayed. It allows the user to easily focus and see the patterns and trends amongst data .
Limitations of Data Visualization
One major issue in data visualization is the fact that as the volume of the data increases it becomes difficult to distinguish patterns from datasets, another major issue is that the display format from visualization is restricted to two dimensions by the display device be it a computer screen or a paper .
3. Aims and Objectives
The current data mining algorithms and techniques are designed for individual problems, such as classification or clustering, but there is no unifying theory. A theoretical and practical framework is designed that unifies different data mining tasks including clustering, classification, prediction, visualization and association rules using multiagent system in distributed environment. The proposed UMDM will be tested on medical datasets.
4. Questions and Problem
The following are sample questions that may be asked to a specialist medical doctor:
What type of prediction, classification, interpretation and visualization is required in the medical databases particularly diabetes?
Which attribute or the combinations of the attributes of diabetes dataset have the impact to predict diabetes in the patient?
What are the future requirements for prediction of disease like diabetes?
Relationship between the attributes which will provide some hidden pattern in the dataset.
The pattern discovery from large dataset is a three steps process. In first step, one seeks to enumerate all of the associations that occur at least 'a' times in the dataset. In the second step, the clusters of the dataset are created and the third and last step is to construct the 'decision rules' with (if-then statements) the valid pattern pairs. Association Analysis: Association mining is concerned with whether the co-joint event (A,B,C,….) occurs more or less than would be expected on a chance basis. If it occurs as much (within a pre-specified margin), then it is not considered an interesting rule. Predictive Analysis: It is to generate 'decision rules' from the dataset using logical operations. The result of these rules will be either 'true' or 'false' .
The methods and applications of data mining are based on computational intelligence such as artificial neural network, k-means clustering, decision trees and data visualization . Many classifiers have been introduced for prediction, including Logistic Regression, Naïve Bayes, Decision Tree, K-local hyper plane distance nearest neighbour classifiers, Random Decision Forest, Support Vector Machine (SVM) etc . Among the different algorithms in data mining for prediction, classification, interpretation and visualization, 'k-means clustering', 'decision trees', neural networks' and 'data visualization (2D or 3D scattered graphs)' algorithms are commonly adopted data mining tools. In medical sciences, the classification of medicines, patient records according to their doses etc. can be performed by applying the clustering algorithms. The issue is how to interpret these clusters. To do so visualization tools are indispensable. Taking this aspect into account a Unified Model of Data Mining (UMDM) algorithms is proposed, which will unify these data mining algorithms into a single black box so that the user needs to provide the dataset and recommendations from specialist doctor as the input. Figure 1 depicts the inputs and outputs of UMDM.
Medical Data Miner
A Medical Dataset
Figure 1: A Unified Model of Data Mining
A multiagent system (MAS) is used in this proposed UMDM, which is capable of performing classification and interpretation of large datasets. In this MAS 'k-means clustering' algorithm is used for classification and 'decision tree C4.5' is interpretation. For prediction and visualization, neural networks and 2D scattered graphs are used. This multiagent system is cascaded i.e. the output of an agent is an input for the other agents .
A medical dataset 'Diabetes' is chosen for UMDM and first k-means clustering algorithm will be applied on this dataset. The interval scaled data is properly cleansed by applying the range method. The attributes of the dataset/testbed 'Diabetes' are: Number of Times Pregnant (NTP)(min. age = 21, max. age = 81), Plasma Glucose Concentration a 2 hours in an oral glucose tolerance test (PGC), Diastolic Blood Pressure (mm Hg) (DBP), Triceps Skin Fold Thickness (mm) (TSFT), 2-Hour Serum Insulin (m U/ml) (2HSHI), Body Mass Index (weight in kg/(height in m)^2) (BMI), Diabetes Pedigree Function (DPF), Age, Class (whether diabetes is cat 1 or cat 2) .
There are two main sources of data distribution, first is the centralized data source and second is the distributed data source. The distributed data source has further two approaches of data partitioning, first, the horizontally partitioned data, where same sets of attributes are on each node, this case is also called the homogeneous case. The second is the vertically partitioned data, which requires that different attributes are observed at different nodes, this case is also called the heterogeneous case. It is required that each node must contain a unique identifier to facilitate matching in vertical partition .
The vertical partitioning of dataset 'Diabetes' will be created on the basis of attributes values. The attribute 'class' is a unique identifier in all these partitions. This is represented in tables from 2 to 5.
Table 2: Vertically distributed Diabetes dataset at node 1
Table 3: Vertically distributed Diabetes dataset at node 2
Table 4: Vertically distributed Diabetes dataset at node 3
Table 5: Vertically distributed Diabetes dataset at node 4
The k-means clustering algorithm will be applied on the above created vertical partitions. The value of 'k', number of clusters is set to 4 and the number of iterations 'n' in each case is 50 i.e. value of k =4 and value of n=50. The decision rules for these obtained clusters will be created by using decision tree (ID3) algorithm. For the further interpretation and visualization of the results of these clusters, 2D scattered graphs are drawn using data visualization.
Developed testbeds are essential for progress in data mining. The requirements for data mining testbeds are different than those for general purpose high performance computing testbeds. For example, the computing resources for data mining testbeds are as much disk-oriented as processor-oriented; the network resources must be able move datasets and data elements between geographically distributed sites with guaranteed quality of service .
The research will be completed within next 15 months.