# Data Mining Algorithms Used In The Medical Data Miner Accounting Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Medical datasets hold huge number of records about the patients, the doctors and the diseases. The extraction of useful information which will provide knowledge in decision making process for the diagnosis and treatment of the diseases are becoming increasingly determinant. The Knowledge Discovery makes use of Artificial Intelligence (AI) algorithms such as 'k-means clustering', 'Decision Trees (ID3)', 'Neural Network (NNs)' and 'Data Visualization (2D or 3D scattered graphs)'. In this paper, the mentioned algorithms are unified into a tool, called Medical Data Miner that will enable prediction, classification, interpretation and visualization on a diabetes dataset.

Keywords: Medical Data Miner, Data Mining, Multiagent System, Knowledge Discovery

## 1. Introduction

The vast amount of data in medical datasets are generated through the health care processes, whereby, clinical datasets are more significant ones. The data mining techniques help to find the relationships between multiple parental variables and the outcomes they influence. The methods and applications of medical data mining are based on computational intelligence such as artificial neural network, k-means clustering, decision trees and data visualization (Irene M. Mullins et al 2006, Gupta, Anamika et al 2005, Zhu, L. et al 2003, Padmaja, P. et al 2008, Yue, Huang et al 2004). The purpose of data mining is to verify the hypothesis prepared by the user and to discover or uncover new patterns from the large datasets.

Many classifiers have been introduced for prediction, including Logistic Regression, NaÃ¯ve Bayes, Decision Tree, K-local hyper plane distance nearest neighbour classifiers, Random Decision Forest, Support Vector Machine (SVM) etc (Dong, Q.W., Zhou, S.G., and Liu, X 2010 and IIango, B. Sarojini, and Ramaraj, N. 2010). Among the different algorithms in data mining for prediction, classification, interpretation and visualization, 'k-means clustering', 'decision trees', neural networks' and 'data visualization (2D or 3D scattered graphs)' algorithms are commonly adopted data mining tools. In medical sciences, the classification of medicines, patient records according to their doses etc. can be performed by applying the clustering algorithms. The issue is how to interpret these clusters. To do so visualization tools are indispensable. Taking this aspect into account we are proposing a Medical Data Miner which will unify these data mining algorithms into a single black box so that the user needs to provide the dataset and recommendations from specialist doctor as the input. Figure 1 depicts the inputs and outputs of Medical Data Miner.

Medical Data Miner

A Medical Dataset

Doctor's Proposals

Prediction

Classification

Interpretation

Visualization

## Figure 1: A Medical Data Miner

The following are sample questions that may be asked to a specialist medical doctor:

What type of prediction, classification, interpretation and visualization is required in the medical databases particularly diabetes?

Which attribute or the combinations of the attributes of diabetes dataset have the impact to predict diabetes in the patient?

What are the future requirements for prediction of disease like diabetes?

Relationship between the attributes which will provide some hidden pattern in the dataset.

A multiagent system (MAS) is used in this proposed Medical Data Miner, which is capable of performing classification, interpretation and visualization of large datasets. In this MAS k-means clustering algorithm is used for classification, ID3 is interpretation, NNs is for prediction and 2D scattered graphs are used for visualization, more over, this multiagent system is cascaded i.e. the output of an agent is an input for the other agents (Voskob, Max, and Howey, Rob 2003).

In section 2 we present an overview of data mining algorithms used in the Medical Data Miner, section 3 deals with the methodology whereas, the obtained results are discussed in section 4 and finally section 5 presents the conclusion.

## 2. Overview of Data Mining Algorithms used in the Medical Data Miner

Data mining algorithms are accepted nowadays due to their robustness, scalability and efficiency in different fields of study like bioinformatics, genetics, medicine and education and many more areas. The classification, clustering, interpretation and data visualization are the main areas of data mining algorithms (Skrypnik, Irina et al 1999, and Peng, Y., Kou, G., Shi, Y., and Chen, Z 2008). Table 1 shows the capabilities and tasks that the different data mining algorithms can perform.

## DM Algos.

## Estimation

## Interpretation

## Prediction

## Classification

## Visualization

Neural Network

Y

N

Y

N

N

Decision Tree

N

Y

Y

Y

N

K-Means

Y

N

Y

Y

N

Kohonen Map

Y

N

Y

Y

N

Data Visualization

N

Y

Y

Y

Y

K-NN

Y

N

Y

Y

N

Link Analysis

Y

N

Y

N

N

Regression

Y

N

Y

N

N

Bayesian Classification

Y

N

Y

Y

N

Overall Decision

All

Only 2

All

Only 6

Only 1

## Table 1: Functions of Different Data Mining Algorithms

## 2.1. Neural Networks

The neural networks are used for discovering complex or unknown relationships in dataset. They detect patterns from the large datasets for prediction or classification, also used in system performing image and signal processing, pattern recognition, robotics, automatic navigation, prediction and forecasting and simulations. The NNs are more effective and efficient on small to medium sized datasets. The data must be trained first by NNs and the process it goes through is considered to be hidden and therefore left unexplained . The neural network starts with an input layer, where each node corresponds to a predictor variable. These input nodes are connected to a number of nodes in a hidden layer. Each input node is connected to every node in the hidden layer. The nodes in the hidden layer may be connected to nodes in another hidden layer, or to an output layer. The output layer consists of one or more response variables. Figure 2 illustrates the different layers (Liu, Bing 2007, and Two Crows 1999).

Trained Data

Trained Data

Unknown Data

Unknown Data

Unknown Data

Useful Data

Inputs

Hidden Layer

Output

## Figure 2: A Neural Network with one hidden layer

## 2.2. Decision Tree Algorithm

The decision tree algorithm is used as an efficient method for producing classifiers from data. The goal of supervised learning is to create a classification model, known as a classifier, which will predict, with the values of its available input attributes, the class for some entity. In other words, classification is the process of dividing the samples into pre-defined groups. It is used for decision rules as an output. In order to do mining with the decision trees, the attributes have continuous discrete values, the target attribute values must be provided in advance and the data must be sufficient so that the prediction of the results will be possible. Decision trees are faster to use, easier to generate understanding rules and simpler to explain since any decision that is made can be understood by viewing path of decision. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The decision rules are obtained in the form of if-then-else, which can be used for the decision support systems, classification and prediction. Figure 3 illustrates how decision rules are obtained from decision tree algorithm.

Data

ID3 Algorithm

Decision Rules

## Figure 3: Decision Rules from a Decision Tree Algorithm

The different steps of decision tree (ID3) algorithm are given below:

Step 1: Let 'S' is a training set. If all instances in 'S' are positive, then create 'YES' node and halt. If all instances in 'S' are negative, create a 'NO' node and halt. Otherwise select a feature 'F' with values v1,...,vn and create a decision node.

Step 2: Partition the training instances in 'S' into subsets S1, S2, ..., Sn according to the values of V.

Step 3: Apply the algorithm recursively to each of the sets Si.

The decision tree algorithm generates understandable rules, performs classification without requiring much computation, suitable to handle both continuous and categorical variables and provides an indication for prediction or classification (MacQueen, J.B. 1967, Liu, Bing 2007 and Two Crows 1999).

## 2.3 k-means Clustering Algorithm

The 'k', in the k-means algorithm stands for number of clusters as an input and the 'means' stands for an average, location of all the members of a particular cluster. The algorithm is used for finding the similar patterns due to its simplicity and fast execution. This algorithm uses a square-error criterion in equation 1 for re-assignment of any sample from one cluster to another, which will cause a decrease in the total squared error.

(1)

Where (F - C)2 is the distance between the datapoints. It is easy to implement, and its time and space complexity are relatively small. Figure 4 illustrates the working of clustering algorithms.

Dataset

K-means Algorithm

Clusters of Dataset

## Figure 4: The Function of the Clustering Algorithms

The different steps of k-means clustering algorithm are given below:

Step 1: Select the value of 'k', the number of clusters.

Step 2: Calculate the initial centroids from the actual sample of dataset. Divide datapoints into 'k' clusters.

Step 3: Move datapoints into clusters using Euclidean's distance formula in equation 2. Recalculate new centroids. These centroids are calculated on the basis of average or means.

(2)

Step 4: Repeat step 3 until no datapoint is to be moved.

Where d(xi, xj) is the distance between xi and xj. xi and xj are the attributes of a given object, where i, j and k vary from 1 to N where N is total number of attributes of that given object, indexes i, j, k and N are all integers (Davidson, Ian 2002, Liu, Bing 2007 and Two Crows 1999). The K-means clustering algorithm is applied in number of areas like, Marketing, Libraries, Insurance, City-planning, Earthquake studies, www and Medical Sciences (Peng, Y., Kou, G., Shi, Y., Chen, Z. 2008).

## 2.4. Data Visualization

This method provides the better understanding of data to the users. Graphics and visualization tools better illustrate the relationship among data and their importance in data analysis cannot be overemphasized. The distributions of values can be displayed by using histograms or box plots. 2D or 3D scattered graphs can also be used. Visualization works because it provides the broader information as opposed to text or numbers. The missing and exceptional values from data, the relationships and patterns within the data are easier to identify when graphically displayed. It allows the user to easily focus and see the patterns and trends amongst data. One major issue in data visualization is the fact that as the volume of the data increases it becomes difficult to distinguish patterns from datasets, another major issue is that the display format from visualization is restricted to two dimensions by the display device be it a computer screen or a paper (Two Crows 1999).

## 3. Methodology

We will first apply the K-means clustering algorithm on a medical dataset 'Diabetes'. This is a dataset/testbed of 790 records. Before applying k-means clustering algorithms on this dataset, the data is pre-processed, called data standardization. The interval scaled data is properly cleansed by applying the range method. The attributes of the dataset/testbed 'Diabetes' are: Number of Times Pregnant (NTP)(min. age = 21, max. age = 81), Plasma Glucose Concentration a 2 hours in an oral glucose tolerance test (PGC), Diastolic Blood Pressure (mm Hg) (DBP), Triceps Skin Fold Thickness (mm) (TSFT), 2-Hour Serum Insulin (m U/ml) (2HSHI), Body Mass Index (weight in kg/(height in m)^2) (BMI), Diabetes Pedigree Function (DPF), Age, Class (whether diabetes is cat 1 or cat 2) (web site of National Institute of Diabetes and Digestive and Kidney Diseases, Pima Indians Diabetes Dataset 2010).

There are two main sources of data distribution, first is the centralized data source and second is the distributed data source. The distributed data source has further two approaches of data partitioning, first, the horizontally partitioned data, where same sets of attributes are on each node, this case is also called the homogeneous case. The second is the vertically partitioned data, which requires that different attributes are observed at different nodes, this case is also called the heterogeneous case. It is required that each node must contain a unique identifier to facilitate matching in vertical partition (Irene M. Mullins et al 2006, and Skrypnik, Irina 1999).

In this paper we use the vertical partitioning of dataset 'Diabetes'. We create the vertical partition of the dataset on the basis of attributes values. The attribute 'class' is a unique identifier in all these partitions. This is represented in tables from 2 to 5.

## NTP

## DPF

## Class

4

0.627

-ive

2

0.351

+ive

2

2.288

-ive

## Table 2: Vertically distributed Diabetes dataset at node 1

## DBP

## AGE

## Class

72

50

-ive

66

31

+ive

64

33

-ive

## Table 3: Vertically distributed Diabetes dataset at node 2

## TSFT

## BMI

## Class

35

33.6

-ive

29

28.1

+ive

0

43.1

-ive

## Table 4: Vertically distributed Diabetes dataset at node 3

## PGC

## 2HIS

## Class

148

0

-ive

85

94

+ive

185

168

-ive

## Table 5: Vertically distributed Diabetes dataset at node 4

Each partitioned table is a dataset of 790 records; only 3 records are exemplary shown in each table.

We will first apply the K-means clustering algorithm on the above created vertical partitions. The value of 'k', number of clusters is set to 4 and the number of iterations 'n' in each case is 50 i.e. value of k =4 and value of n=50. The decision rules for these obtained clusters will be created by using decision tree (ID3) algorithm. For the further interpretation and visualization of the results of these clusters, 2D scattered graphs are drawn using data visualization.

## 4. Results and Discussion

The pattern discovery from large dataset is a three steps process. In first step, one seeks to enumerate all of the associations that occur at least 'a' times in the dataset. In the second step, the clusters of the dataset are created and the third and last step is to construct the 'decision rules' with (if-then statements) the valid pattern pairs. Association Analysis: Association mining is concerned with whether the co-joint event (A,B,C,â€¦.) occurs more or less than would be expected on a chance basis. If it occurs as much (within a pre-specified margin), then it is not considered an interesting rule. Predictive Analysis: It is to generate 'decision rules' from the diabetes medical dataset using logical operations. The result of these rules after applying on the 'patient record' will be either 'true' or 'false' (Zheng, F. et al 2010).

These four partitioned datasets of medical dataset 'Diabetes' are inputted to our proposed MDM one by one respectively, total sixteen clusters are obtained, four for each node. The 2D scattered graphs of the interesting clusters are shown in figures 5, 6, 7 and 8.

## Figure 5: A Scattered Graph for cluster 1 of node 4 between PGC and HIS attributes of Diabetes dataset

The graph in figure 5 shows the distances between the attributes 'PGC' and '2HIS' is variable from beginning to the end. This shows that the 'class' attribute 'category' of diabetes dataset does not depend on these two attributes, i.e. if one attribute gives category 1 the other will show category 2 in the patient.

## Figure 6: A Scattered Graph for cluster 3 of node 1 between NTP and DPF attributes of Diabetes dataset

The graph in figure 6 shows at the beginning the distance between the attributes 'PGC' and '2HIS' is constant then the distance varies and again the distance becomes constant at the end. This graph has two regions, one is from 0 to 12 and the second is from 13 to 30.

## Figure 7: A Scattered Graph for cluster 4 of node 4 between PGC and 2HIS attributes of Diabetes dataset

The graph in figure 7 shows that there is variable distance between 'PGC' and '2HIS' from beginning to the end. The structure of this graph is similar to graph in figure 5. In this graph the 'class' attribute 'category' does not depend on both of these attributes. If attribute 'PGC' shows category 1 diabetes in a patient then attribute '2HIS' will give category 2.

## Figure 8: A Scattered Graph for cluster 4 of node 3 between TSFT and BMI attributes of Diabetes dataset

The graph in figure 8 shows that there is almost variable distance between the attributes 'TSFT' and 'BMI', but there are some regions in this graph shows that there is constant distance between these two attributes of diabetes dataset which shows that the 'class' attribute 'category' depends upon both attributes 'TSFT' and 'BMI'.

There are total sixteen decision rules are generated one for each cluster from the proposed MDM. We are taking only two interesting decision rules for the interpretation of clusters are given below:

The Decision Rules of cluster 1 of node 4 are:

Rule: 1 if PGC = "165" then

Class = "Cat2"

else

Rule: 2 if PGC = "153" then

Class = "Cat2"

else

Rule: 3 if PGC = "157" then

Class = "Cat2"

else

Rule: 4 if PGC = "139" then

Class = "Cat2"

else

Rule: 5 if HIS = "545" then

Class = "Cat2"

else

Rule: 6 if HIS = "744" then

Class = "Cat2"

else

Class = "Cat1"

## Figure 9: Decision Rules of node 4 of cluster 1

There are six decision rules of cluster 1 of node 4. The result for this cluster of 'Diabetes' dataset is if the value of attribute 'PGC' is above 120 and the value of attribute 'HIS' is above 500 then the patient has diabetes of category 2 otherwise category 1. The decision rules make it easy and simple for the user to interpret and predict this partitioned dataset of diabetes.

The Decision Rules of cluster 3 of node 1are:

Rule: 1 if DPF = "1.32" then

Class = "Cat1"

else

Rule: 2 if DPF = "2.29" then

Class = "Cat1"

else

Rule: 3 if NTP = "2" then

Class = "Cat2"

else

Rule: 4 if DPF = "2.42" then

Class = "Cat1"

else

Rule: 5 if DPF = "2.14" then

Class = "Cat1"

else

Rule: 6 if DPF = "1.39" then

Class = "Cat1"

else

Rule: 7 if DPF = "1.29" then

Class = "Cat1"

else

Rule: 8 if DPF = "1.26" then

Class = "Cat1"

## Figure 10: Decision Rules of node 1 of cluster 3

There are eight decision rules of cluster 3 of node 1. The result of this cluster of 'Diabetes' dataset is if the value of the attribute 'DPF' is 1.2 then the patient has diabetes of category 1 and if the value of attribute 'NTP' is 2 then the patient has diabetes of category 2. The decision rules make it easy and simple for the user to interpret and predict this partitioned dataset of diabetes.

The importance of the attributes of dataset 'Diabetes' is shown in figures 11, 12 and 13.

## Figure 11: Graph between the Attributes and the percentage Value using k-means clustering Algorithm

The graph in figure 11 shows that the attributes 'PGC' is one of the most important attribute of dataset 'Diabetes' and 'DBP' is less important attribute of this dataset for the prediction by using the k-means clustering algorithm.

## Figure 12: Graph between the Attributes and the percentage Value using Neural Networks Algorithm

The graph in figure 12 shows that almost all the attributes of dataset play important role, due to their high values, in the prediction by using the Neural Networks.

## Figure 13: Graph between the Attributes and the percentage Value using Decision tree Algorithm

The graph in figure 13 shows that the attributes 'PGC' is one of the most important attribute of dataset 'Diabetes' and 'NTP' is less important attribute of this dataset for the prediction by using the Decision Tree algorithm.

## Sr. #

## Attributes

## K-Means

## Decision Tree

## Neural Networks

1

PGC

100.00

100.00

99.13

2

AGE

51.57

36.47

96.59

3

BMI

50.24

52.71

99.53

4

NTP

49.15

4.05

69.90

5

TSFT

33.82

9.92

90.01

6

2HSI

28.45

5.88

74.53

7

DPF

27.86

30.86

100.00

8

DBP

12.34

27.10

95.66

## Table 6: The % Importance of Diabetes Dataset Attributes in three Data Mining Algorithms

The table 6 summaries the % values of all attributes of dataset 'Diabetes' using the K-means clustering, the Neural Networks and the Decision Tree algorithms.

## Figure 14: Graph between the Variables of Diabetes Dataset and % Importance Values for all three Data Mining Algorithms

The graph shows that the % values of all the attributes of the given dataset 'Diabetes' is high from the Neural Networks as compared to the Decision Tree and the K-means clustering algorithms. The % values of all the attributes of the given dataset 'Diabetes' is low from the Decision Tree algorithm as compared to the other two algorithms. The intermediate % values of all the attributes are shown in the above graph from the K-means clustering algorithm. The Neural Networks shows that all the attributes of this dataset are very important in the prediction but when we draw a comparison between all the three algorithms then the attributes 'PGC', 'BMI', 'AGE' and 'DPF' are very important in the prediction of diabetes of category 1 or 2 in patients.

The results obtained for prediction are shown in table 7.

## CLASS

## R

## Net-R

## Avg. Abs.

## Max. Abs.

## RMS

## Accuracy (20%)

## Conf. Interval (95%)

All

0.66

0.66

0.26

0.95

0.35

0.52

0.69

Train

0.65

0.65

0.26

0.95

0.36

0.52

0.70

Test

0.68

0.68

0.25

0.89

0.35

0.52

0.68

## Table 7: Performance Metrics

The prediction depends on the R (Pearson R) value, RMS (Root Mean Square) error, and Avg. Abs. (Average Absolute) error, on the other hand Max. Abs. (Maximum Absolute) error may sometimes be important. The R value and RMS error indicate how "close" one data series is to another, in our case, the data series are the Target (actual) output values and the corresponding predicted output values generated by the model. R values range from -1.0 to +1.0. A larger (absolute value) R value indicates a higher correlation. The sign of the R value indicates whether the correlation is positive (when a value in one series changes, its corresponding value in the other series changes in the same direction), or negative (when a value in one series changes, its corresponding value in the other series changes in the opposite direction). An R value of 0.0 means there is no correlation between the two series. In general larger positive R values indicate "better" models. RMS error is a measure of the error between corresponding pairs of values in two series of values. Smaller RMS error values are better. Finally, another key to using performance metrics is to compare the same metric computed for different datasets. Note the R values highlighted for the Train and Test sets in the above table. The relatively small difference between values (0.65 and 0.68) suggests that the model generalizes well and that it is likely to make accurate predictions when it processes new data (data not obtained from the Train or Test dataset).

A graph is drawn between the target output and the predicted output as shown in figure 15.

## Figure 15: A Graph between the Target Output and the Predicted Output using Neural Networks

The graph in figure 15 shows that the predicted outputs and the target outputs are close with each other. There are two results are drawn from this graph that the data in the dataset is properly cleansed and prediction may be more accurate and the 'class' attribute 'category' of diabetes dataset depends on all the other remaining attributes of this dataset.

## 5. Conclusion

In this research paper we present the prediction, classification, interpretation of a dataset 'Diabetes' using three data mining algorithms; namely, k-means clustering, Decision tree and Neural networks. For the visualization of these results, 2D scattered graphs are drawn. We first create a vertical partition of the given dataset, based on the similar values of the attributes. For the discovery of interesting pattern from the given dataset we combine three data mining algorithms namely, k-means clustering , decision tree and neural network in cascaded way i.e. the output of one algorithm is used as an input for other algorithm. The decision rules obtained from the decision tree algorithm can further be used as simple queries for any medical databases. One interesting finding from this case is that the pattern identified from the given dataset is "Diabetes of category 1 or 2 depends upon 'Plasma Glucose Concentration', 'Body Mass Index', 'Diabetes Pedigree Function' and 'Age' attributes". We draw the conclusion that the attributes 'PGC', 'BMI', 'DPF' and 'AGE' of the given dataset 'Diabetes' play important role in the prediction whether a patient is diabetic of category 1 or category 2. However, the results and model proposed in this paper require further validation and testing from medical experts.