This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Two major problems in data mining are 1) Dealing with missing values in the datasets used for knowledge discovery, and 2) using one data set as a predictor of other datasets. We explore this problem using four different datasets from the UCI Machine learning repository, from four different sources with different missing values. Each dataset contains 13 attributes and one class attribute which denotes the presence of heart disease and the absence of heart disease. Missing values were replaced in a number of ways; first by using normal mean and mode method, secondly by removing the attributes that contains missing values, thirdly by removing the records that contains more than 50 percent of values missing and filling the remaining missing values. We also experimented with different classification techniques, including Decision tree, Naive Bayes, and Multilayer Perceptron, using Medical Datasets. Rapid Miner and Weka tools. The consistency of the datasets was found by combining the datasets together and comparing the results of this datasets with the classification error of different datasets. It can be seen from the results that if fewer number of missing values are present, the normal mean and mode method is good. If larger amount of missing values are present than the third method of removing records along with different preprocessing steps works better, and using one dataset as a predictor of other dataset may not classify the data correctly in all the cases.
Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. Various steps involved in mining the data are data integration, data selection, data cleansing, data transformation, data mining .The first step in data mining is Data selection; that is, to only select the data which is useful for data mining. The second step is data cleansing; the data we collected may have errors, missing values; inconsistent data which must be corrected. Even after data cleansing data is not ready for data mining, so the next step is data transformation; that is, aggregation, smoothing, normalization, discretization etc. The final step is the data mining itself; that is, to find interesting patterns within the data. Data mining techniques include classification, regression, clustering, association etc.
The main goal of data mining is classification. Given a collection of records containing a set of attributes, where one of the attribute is class attribute, our goal is to find the model for class attribute as a function of values of other attributes. A training set is used to build the model and a test set is used to classify the data according to that model. To predict the performance of a classifier on new data, we need to assess its error rate on a dataset that played no part in the information of classifier. This independent dataset is called a test set 
The main focus of this paper is to find the patterns in the data set related to coronary artery disease from UCI heart data sets. According to a 2007 report, nearly 16 million Americans have coronary artery disease (CAD). In U.S., coronary artery disease is the leading killer of both men and women. Each year, nearly 500,000 people die because of CAD.
Usually a medical dataset consists of a number of tests to be conducted to diagnose a disease. However, most medical datasets have large numbers of missing values because of the tests that are not conducted; many useful attribute values will be missing in medical data set due to the expense of performing tests, attributes that could not be recorded when the data was collected, or attributes ignored by users because of privacy concerns. Further complicating the problem from a data mining standpoint is that different groups of physicians collect different data; that is, different medical datasets often contain different attributes, making it difficult to use one dataset as a predictor of another.
This paper mainly focuses on two different issues. The first focus of this paper is preprocessing the data mainly dealing with missing attributes, and to find the improved accuracy of the data set after preprocessing steps and to compare the accuracy using different classification algorithms such as decision tree, Naive Bayes and Neural Networks.
The second focus is to train the data using classification techniques and to test the data using different datasets, to verify the results obtained from the trained data. The Cleveland database, collected from Cleveland Clinic Foundation, was used as the training set. The Switzerland, Hungary, and VA Long Beach data sets were used as test sets. All the datasets contain 13 attributes and one class attribute. The Cleveland dataset used for training contains only 6 missing values, and the three test datasets contain almost 90% missing values.
1.4 DATA SET
The heart data set is collected from UCI repository. Each data set consists of 13 attributes among that 6 are numerical and 8 are categorical attributes and one special attribute.
The following are the four data sets for heart disease:
4. VA long beach.
The source and the creator of the datasets:
1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D
This datasets contains 76 attributes but only 14 attributes are mostly used in most of the research. The presence value of heart disease is a value in the range of 1,2,3,4, with an absence value 0.
The 14 attributes that are used are:
1. Age in years
2. Sex - (1=Male; 0=Female)
3. Cp -Chest pain type
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic
4. Resting blood pressure (in mm Hg on admission to the hospital)
5. Serum cholesterol in mg/dl
6. (Fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. Resting electrocardiographic results
Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. Maximum heart rate achieved
9. Exercise induced angina (1 = yes; 0 = no)
10. ST depression induced by exercise relative to rest
11. The slope of the peak exercise ST segment
-- Value 1: up sloping
-- Value 2: flat
-- Value 3: down sloping
12. Number of major vessels (0-3) colored by flourosopy
13. Thal-3 = normal; 6 = fixed defect; 7 = reversable defect
14. The predicted attribute
Diagnosis of heart disease (angiographic disease status)
Value 0: < 50% diameter narrowing
Value 1, 2, 3, 4: > 50% diameter narrowing
The attributes with the largest amount of missing values are slope of the peak exercise, ST segment represented as slope, number of major vessels (0-3) colored by flourosopy represented by ca , normal; fixed defect; reversable defectÂ represented by thal, and serum cholestoral in mg/dl represented by chol.
2 LITERATURE REVIEW
2.1 Rapid miner
RapidMiner, formerly known YALE (Yet Another Learning Environment), is software widely used for machine learning, knowledge discovery and data mining. RapidMiner is being used in both research and also in practical data mining fields.
The Java programming language is used in rapid miner, which means it can run in any operating system. RapidMiner can handle many formats of input such as CSV, Arff, SPSS, Xrff, Database example source, and attributes that are described in XML file format. Different types of attributes that are present are Input, Output, data preprocessing and visualization.
RapidMiner contains more than 500 operators. The nested operator can be described through graphical user interface XML files which are created with RapidMiner. Individual RapidMiner functions can also be called directly from command line. It is used easily to define analytical steps and to generate graphs more effectively. It provides a large collection of data mining algorithms for performing classification. Many visualization tools such as overlapping histogram, 3D scatter plot and tree charts are present.
RapidMiner can handle any type of tasks like classification, clustering, validation, visualization, preprocessing, post processing etc. It also supports many kinds of preprocessing steps such as discretization, Outlier (Detection and removal), Filter, Selection, weighting, Normalization etc are available. All modeling and attribute evaluation methods from weka are available within RapidMiner. RapidMiner consists of two views Design view and Result view. The design view is used to generate the process and run the process. The result view is used to generate the results. The Image 1. Below shows the RapidMiner Interface.
Image1. RapidMiner Interface
Image 2. RapidMiner Design View
2.2 Decision Trees
Decision trees are a supervised learning techniques commonly used for tasks like classification, clustering and regression. Decision trees are mainly used in the field of finance, engineering, marketing and medicine. Decision tress can handle any type of data that is nominal, numeric and text. They mainly focus on the relationships of the attributes. Input of the decision tree is the set of objects described by the set of properties and gives the output as yes/no decision, or as one of several different classifications.
3. Decision tree created by RapidMiner
Since decision trees can be represented graphically as tree-like structures they are easier to understand by humans. The root node is the beginning of the tree, and each node is used to evaluate the attributes. At each node, the value of the attribute for the given instance is used to determine which branch to follow to a child node. Classification of instances can be done using decision tree starting from the root node and continuing until a leaf node is reached. Decision tree creation involves dividing the training data into root node and leaf node divisions until the entire data set has been analyzed. . The data is split until they have the same set of classification or the splitting cannot be done anymore due to lack of further attributes.
An efficient decision tree is one in which the root node divides the data effectively, and therefore requires fewer nodes. One of the important things is to select an attribute that best splits the data into individual classes. The splitting is done based on the information gain of each attribute. Information gain is based on the concept of entropy, which gives the information required for a decision in bits. Entropy is calculated from
Entropy (P1, P2â€¦ Pn) = -Î£i Pi log2Pi
Information of data is calculated based on each attribute. Entropy gives how important an attribute is by the information that is given. First the entropy of whole data set is calculated. The split is done based on this attribute. The attribute that can best split the data can be found from this. In the same way this procedure is used until the leaf nodes are reached.
A decision tree is built using a training dataset and these trees can then be used to classify examples in test dataset. Decision trees can also be used to explicitly describe data and also used for decision making, as they produce rules which can be easy to understand and can be read by any user.
Sometimes decision tree learning produces a tree that is too large. If the tree is too large the new samples are poorly generated. Pruning is one of the important steps in decision tree learning that addresses this problem. The size of decision tree can be reduced by pruning (that is, removing) the irrelevant attributes for which the accuracy of decision tree does not get reduced by pruning. Pruning improves the accuracy of the tree for future instances. The problem of over fitting and also noisy data can be reduced by pruning since the irrelevant attributes created by them are ignored.
2.3 Naive Bayes
Another classifier is Naive Bayes. Naive Bayes operates in two phases, training set and testing set. Naive bayes is cheap and it can handle around 10,000 attributes. It is also fast and highly scalable model.
Naive Bayes considers attributes as independent of each others in terms of contributing to the class attribute. For example a fruit may be considered as apple if it round, red, and 4'' in diameter. Although these features depend on each other, Naive Bayes considers these features independently to consider it as apple.
One of the problems with Naive Bayes is that Naive Bayes does not require lot of instances for the possible combination of attributes. Naive Bayes can be used for both binary and multiclass classification problems. Naive Bayes can only handle discrete or discretized attributes. Naive Bayes requires binning. Several discretization methods are present they are Supervised and unsupervised discretization. Supervised discretization method uses class information of training set to select discretization cut point. Unsupervised discretization does not use the class information. 
Entire training data set is used for classification and discretization. Unsupervised discretization methods are equal width, Equal frequency and fixed frequency discretization. Error based, Entropy based are supervised discretization methods. . Entropy based discretization uses class information, the entropy is calculated based on the class label then it finds the best split so that the bins are as pure as possible that is the majority of values in a bin correspond to having the same class label. The split is done based on the maximal information gain. 
2.4 Neural Networks
The human brain serves as a model for Neural Networks. Artificial neurons were first proposed in 1943 by Warren McCulloch, a neurophysiologist, and Walter Pitts, an MIT logician. . Neural Networks are useful for data mining and decision support applications. They are also useful for pattern recognition or data classification through the learning process. .
A neural network contains the neurons and weight building blocks. The strength of the network depends on the interaction between the building blocks. The Multilayer Perceptron (MLP) Neural Network Model is mostly used, with networks that consist of three layers Input, Hidden and Output. The values of the input layer come from values in a data set. The input neurons send data via synapses to the hidden layer and through output layer through synapses.
The MLP uses supervised technique called back propagation for training. Each layer is fully connected to the succeeding layer. The signal for each neuron is received from the previous layer; each signal is multiplied by a different weight value. Then the inputs that are weighted are summed and these are passed through the limiting function through this the outputs are scaled through the fixed range of values. Then the output is send to the all the neurons in the next layer. Error at each output is then "back propagated" to the hidden and the inputs, changing the weights based on the derivative of the error with respect to the weights.
MLP training involves adjusting parameters such as the number of neurons to be used in hidden layer. If insufficient number of neurons is used the complex data cannot be modeled and the result would be poor. If more number of neurons are used, it may take long time, it may over fit the data. The network may perform well on the training set but the test set would give poor results.
The Cleveland, Switzerland, Hungary and VA Long Beach are the four datasets collected from UCI Machine Learning Repository for this project. The Cleveland dataset is used for training and Switzerland, Hungary and VA datasets are used for testing. One important problem will be to use the training Cleveland dataset for testing the three datasets. The other important thing would be to deal with the missing values. All the four datasets contains about 13 attributes and one class attribute.
First, all the four datasets are collected in .txt format. The datasets are loaded into RapidMiner using the IO Example Source operator.
Number of Instances
VA Long Beach
3.1. Preprocessing the Data
The first step is preprocessing the data.
Different preprocessing steps that are used are
To fill the missing values
To deal with outliers
Attribute selection ,numeric data discretization, Normalization etc
First preprocessing of Cleveland data is done and the data set is trained using different algorithm. Then different preprocessing steps are done on the test data sets too.
This process is done by using Machine learning tools Rapid miner and Weka. The used data source is heart data sets taken from University of California Irvine (UCI) Machine learning repository. The following section gives details about the Dataset.
The first step would be to preprocess the Cleveland data to train the dataset. This primarily involves dealing with missing values, outliers and feature selection.
Missing values in the dataset represent a lot of different things. They may be due to a test that is not conducted or the data that is not available. Missing values in RapidMiner and weka are usually represented by "?". The quality of the classification of the data could be reduced by the missing values, so filling the missing values plays an important role in data mining. Different methods are used for dealing with missing values. The first method that is used to fill the missing values is the most frequently used method that is replacing the categorical values with the mode and numerical values with mean. The second method is removing the missing attributes that contain around 90% of missing data. Attributes that are removed are ca, thal, slope, chol using this method no change was observed in the classification error. The third method that is used is to remove the instances in a data set if it contains 6 or more than 6 missing values out of 13 values that are present. Since the instance missing too many values would not be good to use due to the consistency of the data. Then the remaining missing values are filled based on the frequency of class attributes.
Outliers are observations that deviate from the original dataset. That is, the instances that are abnormal distance from the other instances in the data. Sometimes they may occur due to some common errors that occur due to data transmission. In some cases outliers plays a significant role in acquisition of the data. Common methods used to identify outliers are Density based outlier detection, Distance based outlier detection and LOF detection methods.
Distance based outlier detection using k-nearest neighbor algorithm to identify outliers. Density based outlier detection uses density function like Square distance, Euclidean distance, angle, cosine distance, inverted cosine distance and the LOFoutlierdetection identifies outliers using minimal upper and lower bounds with a density function.
Feature Selection is mainly used to find features that play an important role in classification, and to remove features with little or no predictive information. This method is mainly used in data set with many features .Types of the Feature selection methods used are Filter method and Wrapper method. Filter method selects features independent of the classifier and Wrapper methods makes use of classifier for feature selection. The filter method select features based on general characteristics of data so the filter methods are much faster than the wrapper method. The wrapper method uses induction algorithm as a evolution function to select feature subset.
Four data sets are collected in .txt format. First the Cleveland dataset is read into Rapid miner by using the attribute editor. The (.dat and .aml) files are created for the Cleveland datasets. The average and mode of each attribute is obtained. Then the missing values are identified. The Cleveland dataset contains only 6 missing values. Removing the missing values from the Cleveland dataset would not affect the dataset since less than 2% of the data is missing. So the missing values are removed from the dataset. The next step would be to deal with the outliers. Distance based outliers methods are used to detect the outliers using the K Nearest Neighbor and Euclidean distance. It has been identified that depending on the significance of outlier, outlier role is determiner in medical data set, depending on role of the outlier they are removed or not removed. The next step would be to select features that plays important role in the data. The Infogainweighting using Filter method and Wrapper method using Forward and Backward selection method are used. The Information gain weight is used to find the feature that plays important role in the classification of data. First the Information gain of the attributes is obtained. The attributes are selected first by choosing 4 attributes, 5 attributes and so on and also by using 50% of the attributes, 70% of the attributes. By using different methods it had been identified that the top 10 features plays important role in the classification. So the top 10 attributes are selected to build the model. Wrapper method selects features depending on the learning algorithm and the features selected by one algorithm may differ for another algorithm. Feature selection mostly improves the accuracy. Forward and backward selections are the two methods. Forward selection starts only with one subset of attribute and additional attributes are added until there is no performance gain. Backward selection is the opposite of forward selection it starts with complete attribute set and attribute are removed from that subset until there is gain in the performance. Decision Tree and MultiLayerPerceptron are used as algorithms to select features using Forward and Backward selection. Then the preprocessed Cleveland data set is saved as a new file.
Once the Cleveland data set used for training is preprocessed it is ready to test with the three testing sets. Before testing the data sets, preprocessing of the three datasets is also done. Three classification algorithms Decision tree, Naive Bayes and MultiLayer Perceptron are used to build a classification model using the Cleveland dataset. The other important steps in Data Mining are Normalization of the data. To transform data into uniform scale common Min and Max value. Then one important step in MLP is choosing the number of hidden layers. As RapidMiner has a choice of choosing the number of hidden layers. First the numbers of hidden layers are chosen as 0, 1, 2 and so on and minimum of one hidden layer is used. Then the model is build using the classification algorithms. The next important step is preprocessing of test datasets. One important step in preprocessing this datasets is dealing with the missing values. Once the model is built then the Switzerland, Hungarian and VA long Beach are used to test the model.
The Switzerland, Hungarian and VA long Beach datasets consists of 50% to 90% of missing values. Because of this, it is not possible to simply remove instances with missing values.
The following missing values methods are used to fill the missing data.
As the cholesterol attribute has about 99% of missing values in one dataset, it was replaced by normal value based on age and gender. The cholesterol value is replaced by normal level of values that is for Females below age 40 years chol level is 183 mg/dL, from age 40 to 49 years chol level is 119 mg/dL, from 50 years or above chol level is 219 mg/dL. For male below 40 years chol level is 185 mg/dL, age 40 to 49 years chol level is 205 mg/dL, age above 50 years chol level is 208 mg/dL.
The first method that is used to fill the missing values is the most frequently used method; that is, replacing the categorical values with the mode and numerical values with mean. By replacing the missing values in all the data sets with this method the Hungarian gave less classification error while the other two datasets still produced high classification error.
Different methods are used to deal with missing in the two data sets that produced highest classification error. One method is removing the missing attributes that contain around 90% of missing data. Attributes that are removed are ca, thal and slope,chol. using this method did not affect improve the number of correct prediction. So the third method is used to deal with missing values.
The next method that is used is to remove the instances in a data set if it contains 6 or more than 6 missing values out of 13 values that are present. This is because instances missing too many values would not be good to used due to the consistency of the data. Since the ca attribute contains about 99% of missing values the ca attribute is removed from the datasets. Since ca is redundant attribute, removing this attributes does not affect the data set. Then the remaining missing values are filled based on the frequency of class attribute. From this method however the Switzerland data set did not produce any change in results. There was an increase in number of correct predictions for Hungary data set.
3.2. Building the Model
In order to compare the effectiveness of different classification algorithms, decision tree classification, Naive Bayes, and Multilayer Perceptron are used. First the Cleveland data set is used to build the model using the Decision tree classification algorithm in Rapid miner. Different criteria are used to build the decision tree, The criteria's used are gini index, gain ratio and information gain. However the information gain produced a better decision tree. So the criteria used for attribute selection and also for numerical split for building decision tree is information gain. Simple accuracy is not the best to determine the classifier, so sensitivity and specificity are used instead.
The accuracy on the positive instances is Sensitivity:
Sensitivity = True Positive/ (True Positive + False Negative)
The accuracy on the negative instances is Specificity:
Specificity = True Negative/ (True Negative + False Positive)
Accuracy = True Positive + True Negative/ (True Positive + False Negative+ True Negative +
MultilayerPerceptron method is used with more than one hidden layer to find the accuracy. In this way the training set is used to test all the three datasets. The datasets are tested by replacing the missing values by three different methods.
Then to find the consistency of the algorithms all the four datasets are combined together since all the datasets contains the same attributes and one class attributes, then the preprocessing of the data is done. Once the preprocessing is done then the combined dataset is divided into two parts that is training data and the test data set. 70 percent of the data is used for training and the remaining 30 percent of the data is used for testing. Then the Decision tree, NaÃ¯ve Bayes and the Multilayer Perceptron are used to build the model and then test the model that was built.
Different experiments are conducted to test how the data set collected from one source act as a predictor of another data sets collected from different sources and to compare the accuracy using different algorithm. Since the data set contains missing values three different methods are used to fill the missing values. Then the accuracy is obtained to identify the method that worked better to fill the missing values. Different preprocessing steps are also conducted. The features that plays important role are identified.
To check the consistency of the data all the four data sets are combined together and 30% of the data is used for training and the remaining 70% of data is used for testing.
The first step in data mining is to identify the percentage of missing values in each data set. The following Table 1. Gives the percentage of missing values in each attribute of the data sets. It can be observed that the Hungary data set contains less number of missing values compared to Switzerland and VA Long Beach.
Table 1. Percentage of missing values in each dataset
VA Long Beach
The main goal is to use one data set as a predictor of another data set. So we need to build a model using the Cleveland data set. So the first step would be preprocessing of the training data set and to build the model using Decision tree, Naive Bayes and Multilayerperceptron (MLP). The following Table 2. Gives the accuracy obtained using three Algorithms after building the model that is the correct number of predictions. It can be observed from the table that the MultiLayerPerceptron worked better for building the model with an accuracy of 91.75% and second Decision Tree with an accuracy of 81.19%.
Table2. Accuracy of the Cleveland dataset using classification Algorithms
Once the missing values are filled using the normal method as mentioned above. The Model that is build using the Cleveland Data set is used to test the datasets Hungary, Switzerland and VA Long Beach. To test how the model build was performed on the testing data sets. The data from the Table 3 below shows that the highest amount of accuracy is obtained from the Hungary data set. It can be observed from the table that the highest numbers of correct predictions are obtained from the Hungary data set. As seen from the Table 1 we know that the numbers of missing values in the Hungary data set are less compared to the Switzerland and VA Long Beach. So the normal method works better for Hungary data set.
Table 3. Accuracy obtained after the model was tested
Training and Test datasets
Cleveland and Switzerland
Cleveland and VA Long
Cleveland and Hungary
The next important step is selecting the features that plays important role in Feature Selection the Information gain weighting is used to select the features. The top 10 attributes are selected to build the model. The following Table 4. gives the Information gain weighting of the attributes.
Table 4. Information gain of attributes
After selecting the attributes using the Information gain weighting the top 10 attributes are used to build the model and the model build is used to test the three data sets to see if the number of correct predictions is improved. The Table 5. below shows the increase and decrease in the number of correct predictions.
Table 5. Information gain weighting of attributes
VA Long Beach
Different methods are used to select the Features using the Wrapper method. The forward and backward selection methods are used to select the features. The Decision Tree and MLP are used to select the attributes using forward selection, backward selection method. The following is the accuracy obtained by using forward selection method and backward selection methods. The Decision tree, Naive Bayes and MLP algorithms are used.
Table 5. Accuracy of Hungary Data set using Wrapper method
Backward Selection DT
Backward Selection using MLP
Forward Selection MLP
From Table 3 and Table 5 it can be noticed that the model build by the Cleveland data set was able to test the Hungary data set better than the Switzerland and VA Long Beach. While correct number of predictions made for the Switzerland and VA Long Beach algorithms are Very less that is less than 30% of attributes are only correctly predicted.
So the third method is used to fill the missing values in the Switzerland and VA Long Beach algorithms and the same preprocessing steps are continued to test data sets and the Accuracy of the data sets is obtained. The Table 6 below shows the accuracy of the dataset is improved than compared to the First method that is used to fill the missing values. The numbers of correct predictions are improved.
Table 6. Accuracy of the VA Long Beach after replacing the missing values by Third method
The next step is to select the features that plays important role in the data set. To predict the features that plays important role in the classification of the data set. It shows that the numbers of correct predictions are improved in some cases while the number of correct predictions in some cases remains the same.
Table 7. Feature Selection accuracy of VA Long Beach
The following Table 8 gives the accuracy obtained using full data set. 30% of the data is used for testing and 70% of the data was used for testing. The model is build using Decision tree, Naive Bayes and MLP. The Table 8. below shows the correct number of predictions are around 60% .To table is mainly used to find the consistency of the data.
Table 8. Correct number of predictions using Full data set