Prediction of Coronary Heart Disease using Supervised Machine Learning Algorithms
✅ Paper Type: Free Essay | ✅ Subject: Computer Science |
✅ Wordcount: 5293 words | ✅ Published: 8th Feb 2020 |
Keywords— Coronary Heart Disease, Supervised Learning, Naïve Bayes Algorithm (NB), Support vector machine (SVM), Decision Tree (DT) J48, Machine Learning.
I. INTRODUCTION
In this age of technology and digitalization, data has proven to be the fuel of organizations and industries. The healthcare industry is not far behind in this respect. Nowadays, almost all hospitals and medical institutes have their patient’s data stored in e-format. This includes their medical history, symptoms displayed, diagnosis, duration of illness, recurrences as well as any fatalities. As a result, the quantum of medical data being generated on the daily basis is constantly increasing. However, this wealth of data is often left untapped due to lack of effective analytical tools, methods and personnel to discover insights and hidden relationships in this data. If the data at hand is used to develop screening and diagnostic models, it will not only reduce the strain on medical personnel but also aide early detection and prompt treatment for patients thereby drastically enhancing the health system.
Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Find out more about our Essay Writing Service
In recent years, researchers and experts working in the medical field have started realizing the immense knowledge available in these medical datasets thereby inspiring medical analysis of data for instances of Dementia, Alzheimer, Tuberculosis screening, Diabetes, Cancer, etc. Amidst this vast array, one of the predominant and most important diagnosis in the field of health analysis is Coronary Heart Disease (CHD). Coronary arteries play a vital role in delivering oxygen to the heart muscle. According to the Southern Cross Medical Care Society of New Zealand, constant build-up of fat or bad cholesterol within these artery walls leads to their narrowing down and eventual blockage thereby giving rise to Coronary Heart Disease (CHD) (Southern Cross, 2018). A mild-level of blockage might just lead to initial discomfort and alterations in the lifestyle of the person. However, when the flow of oxygen through the coronary arteries is severely hampered it can prove to be fatal. The risk factors associated with CHD can be a combination of controllable factors like those influenced by one’s lifestyle and uncontrollable factors like age, ethnicity, family medical history, etc. Early detection of CHD symptoms can help the patient to control some of these risk factors through lifestyle changes and/or medication thus preventing this disease from aggravating into a severe form and proving to be fatal.
In this era of Data Science, Machine Learning tools and algorithms are constantly being used, across various fields, to gain meaningful insights and leverage the information mined to make decisions. They have not only helped in accelerating monetary gains and business success but have also played a vital role in automating and simplifying various processes. Machine learning (ML) is a science wherein computers are trained to learn using data without being comprehensively programmed. SAS (SAS Institute, 2018) defines Machine Learning as an automated analytical model building method used by systems to learn from data, identify patterns and reduce human interference in the decision-making process. Machine Learning algorithms can be broadly classified into two main types: Supervised learning and Unsupervised learning.
Fig. 1: Main type of Machine learning algorithms.
Supervised Learning: It involves training on a labelled dataset using techniques to generate specific knowledge using dependent and independent variable. Here, the algorithm gets certain input variables along with the original output obtained and the algorithm draws comparison between the original and predicted output to find errors and thus modify the model correctly. This learning works well where historical data is used to predict likely future events.
Unsupervised Learning: It involves searching for patterns within the dataset without any restrictions on its variables. This dataset has no historical labels. This learning is apt for transactional data and used in various marketing and promotional market plan decisions.
II. PROBLEM STATEMENT
Medical diagnosis is an intrinsic and complicated task that demands being carried out with acute precision while taking into consideration various factors. Moreover, prediction of Coronary Heart disease is much complex challenge considering the level of expertise, experience and knowledge required for accurate prediction. According to a survey by WHO, medical professionals can correctly predict the heart disease with only 67% accuracy. Considering that the heart disease casualties are expected to rise over the years, there is an immense research scope for predicting coronary heart disease (CHD).
Machine learning techniques allows the use of same algorithm across different datasets. (Sharma & Rizvi, August 2017) This reprogrammable ability of machine learning makes it a strong contender in the techniques used to build models to diagnose CHD. Moreover, the historical medical data already available in medical institutes can be used to train the model so that the final model built will have high predictive accuracy. Based on the supervised learning model developed, the medical personnel and experts will be able to predict if a certain patient who shows the underlying traits of a CHD patient does really suffer from CHD or not.
In this essay, we will look at the existing research on predicting CHD and try to answer the following research question: Which is the most effective algorithm for predicting CHD, for a given dataset? For convenience, we restrict the scope of this essay to supervised learning techniques, specifically following three algorithms:
- Decision Tree J48
- Naïve Bayes Algorithm
- Support Vector Machine (SVM)
III. LITERATURE REVIEW
Various research initiatives have been undertaken by experts, academic scholars and data science community in predicting and screening of medical data for various diseases. One of the challenging predictions in this aspect is that of Coronary Heart Disease. Various machine learning algorithms have been used in past research to carry out these predictions. We will be reviewing a few of these research papers before we go ahead with our study using dataset.
In order to address the need of medical society to develop CHD prediction technique, (Yanwei X, 2007) built data mining model to predict CHD using 100 CHD records and recording the survival rate information. He used SVM, Artificial Neural Network (ANN) and Decision Trees (DT) on 502 instances using 10-fold cross validation technique and confusion matrix to measure the model performance. The accuracy obtained by his study was 92.1%, 91.0% and 89.6% for SVM, ANN and DT respectively. Thus, SVM proved to be the best classifier model in his study.
(Heon Gyu Lee, 2007) used High-rate variability indices to detect Coronary Heart Disease (CAD). He found these HRV indices using multiparametric features like linear and non-linear feature. He carried out prediction of Coronary Heart Disease (CHD) by developing models using classification based on Multiple Association Rule (CMAR), Bayesian classification, C4.5 Decision Tree, Associative classifier and SVM. He carried out feature selection using Statistical analytics tools. The study showed highest Accuracy of 90% for SVM followed by 80% accuracy for CMAR,78% for C4.5 and 81% and 85% accuracy for Naïve Bayes (Tree Augmented NB- TAN) and Naïve Bayes Algorithm (Selection Tree Augmented NB- STAN) respectively.
(I.S.Jenzi, 2013) established relation between key patterns in the dataset of 14 attributes by using association rules. They built a reliable classifier model using classification techniques like decision tree, Naïve Bayes and Neural Network. The Microsoft .NET platform was used to build the graphical user interface (GUI) with the use of IKVM interface and Java libraries to form interconnections. The accuracy of the model was depicted via the receiver operating characteristic (ROC) Curves. The results showed that the area under ROC of data mining was 0.807 which was better than Naïve Bayes algorithm.
(Peter, 2012) carried out a study in which he cleans the data using classification data mining technique. Here with he detected the complex relationships and interdependence of the variables. There after he developed a model for Naïve Bayes, decision tree, k-NN and neural network. Naïve Bayes performed better than other methods with an accuracy of about 83.70%. The accuracy of other classifiers was 76.66%, 75.18% and 78.148% for DT, K-NN and NN respectively.
(Apte, 2012) carried out prediction of heart disease by using a dataset having 13 attributes like sex, blood pressure and cholesterol. She added two more attributes: smoking and obesity. NN, DT and NB classification techniques were used, and the accuracy obtained was 100%, 99.62% and 90.74% respectively. She used confusion matrix to evaluate her model performance. The accuracy of their 15-attribute dataset was100% for neural network.
IV. DATA ANALYSIS
A. Data Description.
The dataset for this research essay has been obtained from South African Heart Disease dataset which is a subset of a larger data set. It has 462 observations (instances) and 10 attributes in all, of which 9 are independent factors and 1 variable, i.e. CHD is the dependent variable or labelled class. The dataset is a retrospective sample of males in a heart-disease high-risk region of the Western Cape in South Africa (KEEL (Knowledge Extraction based on Evolutionary Learning) , 2004-2018) where the labelled class CHD has two predictive outcomes: positive (1) and negative(0). Each high-risk patient was monitored in this study and the attributes obtained were as follows: systolic blood pressure (sbp), cumulative tobacco in kg (tobacco), bad cholesterol also known as low density lipoprotein cholesterol (ldl), adiposity, family history of heart disease (famhist), type-A behaviour (typea), obesity, current alcohol consumption (alcohol), and age at onset (age).
Attribute |
Domain |
Data Type |
Missing values? |
Sbp |
[101,218] |
Integer |
No |
tobacco |
[0.0,31.2] |
Number |
No |
Ldl |
[0.98,15.33] |
Number |
No |
adiposity |
[6.74,42.49] |
Number |
No |
Famhist |
{Present, Absent} |
Factor |
No |
Typea |
[13,78] |
Integer |
No |
Obesity |
[14.7,46.58] |
Number |
No |
alcohol |
[0.0,147.19] |
Number |
No |
Age |
[15,64] |
Integer |
No |
CHD |
[0,1] |
Integer |
No |
Table 1. Attribute Description.
In order to get a clear understanding, we define few of the terms below (Sivakumar, n.d.) :
- Sbp: It is the blood pressure when the heart is contracting.
- Adiposity: It is measured as percent of body fat
- Type-A behaviour: It is characteristic of a person who is competitive, impatient and angry.
- Obesity: It is represented as Body Mass Index (BMI) which is calculated by dividing the weight of the person by the square of his height.
B. Data Pre-processing.
The South African Heart Disease dataset obtained from KEEL (KEEL (Knowledge Extraction based on Evolutionary Learning) , 2004-2018) is available in .dat file format. However, WEKA requires the file to be in.csv else .arff file format. As a result, we first get the data from .dat format to .csv format. This is done as follows:
- Start ‘Microsoft Excel’
- File => Open => Browse => Select “All Files”
- Select and open the downloaded saheart.dat file.
- Select “delimited” => Start Import at line 15 => Next
- Unselect “Tab”
- Select “Comma”, you should see bars separating the data fields now. => Next => Finish.
- Insert column headings.
- Save the excel file in .csv format.
Also, in order to avoid confusion and gain clarity during analysis, we used the ‘If…then’ feature in Microsoft Excel to convert the data type of ‘Class’ from Integer to Factor. Thus, all the value ‘1’ were replaced by ‘Yes’ and all value ‘0’ were replaced by ‘No’. The updated attribute description is as follows:
Attribute |
Domain |
Data Type |
Missing values? |
sbp |
[101,218] |
Integer |
No |
tobacco |
[0.0,31.2] |
Number |
No |
ldl |
[0.98,15.33] |
Number |
No |
adiposity |
[6.74,42.49] |
Number |
No |
famhist |
{Present, Absent} |
Factor |
No |
typea |
[13,78] |
Integer |
No |
obesity |
[14.7,46.58] |
Number |
No |
alcohol |
[0.0,147.19] |
Number |
No |
age |
[15,64] |
Integer |
No |
CHD |
{No, Yes} |
Factor |
No |
Table 2. Attributes Description (Updated)
C. Experimental Setting
1) Environment / Tool used
We will be using the open source software called WEKA (Version 3.8.3) to implement our data mining algorithms. WEKA which stands for Waikato Environment for Knowledge Learning is a computer program developed at the University of Waikato, New Zealand in order to generate insights from raw data. (Hazra, 2017) WEKA supports various data mining tasks ranging from pre-processing, classification, regression, feature selection, clustering, association and visualization. It is an easy and user-friendly software to use.
2) Estimation Methodology
The estimation methodology used here is K-fold cross validation. In this, the entire dataset is split into K folds (in this essay we are using K=10). For each iteration, (K-1) folds i.e. 9 folds are fitted as training dataset and the remaining 1-fold is used as evaluation of the test dataset. This process is repeated K times (i.e. 10 times) and the error rate is noted for each iteration. The final prediction error of the model is the average of all K individual test dataset error rates. (Mathematical Sciences – Chalmers University of Technology and University of Gothenburg, 2009)
3) Algorithm’s used
The machine learning technique employed for the prediction of CHD is the classification technique, a kind of supervised learning. The algorithms used here are:
- Decision Tree J48
- Naïve Bayes Algorithm
- Support Vector Machine (SVM)
D. Confusion Matrix and Evaluation Measures
1) Confusion Matrix
The performance of the developed classification model is measured with the help of its corresponding confusion matrix. Confusion matrix can be defined as a contingency table that displays the number of instances assigned to each class, thus allowing us to calculate the classification accuracy. There can be two or more classes involved, however we have only two classes in our undertaken study thereby giving us a 2×2 confusion matrix for each classification model.
For the experiment under consideration,
Class a = Yes (has CHD)
Class b = No (no CHD)
Predicted Class |
||||
Actual Class |
a (has CHD) |
b (no CHD) |
||
a (has CHD) |
TP |
FN |
||
b (no CHD) |
FP |
TN |
||
Fig. 2 Layout of 2×2 Confusion Matrix.
Let us understand the terms TP, FP, FN and TN used in confusion matrix.
- True Positive (TP): Number of patients that are predicted to have CHD and do actually have CHD.
- False Positive (FP): Number of patients that are predicted to have CHD and do not actually have CHD.
- False Negative (FN): Number of patients that are predicted to not have CHD but do actually have CHD.
- True Negative (TN): Number of patients that are predicted to not have CHD and do not actually have CHD.
Fig. 3, Fig.4 and Fig.5 shows the Confusion matrix for Decision Tree J48, Naïve Bayes Algorithm and Support Vector Machine (SVM) respectively.
Fig. 3 Confusion Matrix for J48.
Fig. 4 Confusion matrix for Naïve Bayes Algorithm
Fig. 5 Confusion matrix for Support Vector Machine (SVM)
2) Evaluation measures
The most common measure of performance comparison is Accuracy. It is the number of predictions made correctly out of the total predictions made by the model. However, accuracy performance may not be reliable enough in this case as we have unequal number of observations in each class output.
Another performance measure that goes hand in hand with accuracy is Error Rate. It is the total number of incorrect predictions made by the model with respect to the total number of predictions, after training the classifier with a given dataset.
The next measure of performance is Precision. It measures how precise is the prediction by the model, i.e. out of the total number of instances predicted as positive, how many instances are actually positive. (Shung, 2018)
The last measure of performance that will be used in this essay is Recall. It measures the number of instances correctly predicted as positive by the classifier out of the total number of instances that are actually positive. (Shung, 2018) Recall is also known as Sensitivity.
E. Results and analysis
The results for the three classification machine learning algorithms i.e. Decision Tree J48, Naïve Bayes Algorithm and Support Vector Machine (SVM) respectively are summarized in the Table 3 below.
J48 |
Naïve Bayes |
SVM |
|
Accuracy |
0.707792 |
0.716450 |
0.709957 |
Error Rate |
0.292208 |
0.283550 |
0.290043 |
Precision |
0.698 |
0.722 |
0.700 |
Recall |
0.708 |
0.716 |
0.710 |
Table 3 Comparison of Performance Measure.
It can be seen from the above results that Naïve Bayes Algorithm outshines other two classification models in all aspects. While all three models tend to show an accuracy of more than 70%, accuracy alone can’t be considered as the performance measurement for the underlying study. This is because the bivariate response of the labelled class in unequal. Out of the original 462 instances, only 160 patients are said to have Coronary Heart Disease (CHD) whereas the remaining 302 individuals do not suffer from Coronary Heart Disease. This discrepancy might sublimely influence the accuracy rate as the model can predict all values of the majority class and thus achieve an overall high accuracy while blinding out the mis-predictions occurring in the minority class. In order to avoid this unbalance affecting our performance measurement we do not solely rely on error rate and accuracy.
Fig. 6 Performance comparison for generated models.
In such situations, we consider Precision and Recall in addition to Error Rate and Accuracy. We can see in Fig. 6 that there is only a very slight difference between the performance of J48 and SVM model. However, overall displaying the best precision and recall rate among the three models under study, Naïve Bayes Algorithm has proven to be the most effective algorithm for predicting CHD, for the South African Heart Disease dataset.
V. CONCLUSION
There are various number of data mining techniques and machine learning algorithms available to facilitate prediction of Coronary Heart Disease. Moreover, many research approaches have been taken with respect to prediction and screening of diseases using medical analysis. This research essay was an attempt to highlight a few of these available techniques of prediction and the performance measures associated with them. While Support Vector Machine and Decision Tree J48 did perform well on its own, Naïve Bayes Algorithm turned out to be the best classifier model. Saying this, future research can be carried out to check if using unsupervised learning techniques before undertaking prediction using classification machine learning techniques, will enhance the model furthermore in terms of its prediction performance.
VI. REFERENCES
- Apte, C. S. (2012). Improve study of Heart Disease prediction system using Data Mining Classification techniques. International journal of computer application.
- Hazra, A. &. (2017). Heart Disease Diagnosis and Prediction Using Machine Learning and Data Mining Techniques: A Review. Advances in Computational Sciences and Technology., pp. 2137-2159.
- Heon Gyu Lee, K. y. (2007). Mining Biosignal Data: Coronary Artery Disease diagnosis Using Linear and Nonlinear Features of HRV. Springer-Verlag Berlin Heidelberg.
- I.S.Jenzi, P. D. (2013). A Reliable Classifier Model Using Data Mining Approach for Heart Disease Prediction. International Journal of Advanced Research in Computer Science and Software Engineering.
- KEEL (Knowledge Extraction based on Evolutionary Learning) . (2004-2018). South African Hearth data set . Retrieved from KEEL (Knowledge Extraction based on Evolutionary Learning) : https://sci2s.ugr.es/keel/dataset.php?cod=184.
- Mathematical Sciences – Chalmers University of Technology and University of Gothenburg. (2009). Regression, Model Selection, and Classification. Retrieved from Chalmers University of Technology and University of Gothenburg: http://www.math.chalmers.se/Stat/Grundutb/GU/MSG500/A09/RegSummary09.pdf
- Peter, T. K. (2012). An empirical study on prediction of heart disease using classification data mining techniques. International Conference on Advances in Engineering, Science and Management (ICAESM).
- SAS Institute. (2018). Machine Learning What it is and why it matters. Retrieved from SAS: https://www.sas.com/en_nz/insights/analytics/machine-learning.html
- Sharma, H., & Rizvi, M. A. (August 2017). Prediction of Heart Disease using Machine Learning Algorithms: A Survey. International Journal on Recent and Innovation Trends in Computing and Communication, 99-104.
- Shung, K. P. (2018, March 15). Accuracy, Precision, Recall or F1? Retrieved from Towards Data Science: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
- Sivakumar, S. (n.d.). Prediction of Coronary Heart Disease by learning from retrospective study. Retrieved from GitHub: http://srisai85.github.io/CHD/heart_attack.html
- Southern Cross. (2018, April). Coronary heart disease – causes, symptoms, prevention. Retrieved from Southern Cross: https://www.southerncross.co.nz/group/medical-library/coronary-heart-disease-causes-symptoms-prevention
- Yanwei X, W. J. (2007). Combination data mining. Proceedings International Conference on Convergence Information Technology, (pp. 868-872).
Cite This Work
To export a reference to this article please select a referencing stye below:
Related Services
View allDMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: