Phd Transfer Report Data Mining Health And Social Care Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Cardiovascular disease (CVD) is the most common cause of premature death. This is expected to increase in the future. Early identification and treatment management can prevent the development of CVD thus improving patient's outcome. It will be beneficial to stratify CVD risks in order to provide better health services and prevention measure. Today, vast amount of medical data are kept in electronic record. These datasets should be fully studied and exploited to improve CVD prevention management. Data mining (DM) is a process of extracting useful knowledge from large dataset by analysing the pattern and the relationships of its attributes. Its applications have been known to many, including medical fields. The aim of the research is to study the use of data mining/machine learning (DM-ML) techniques in extracting knowledge of CVD from electronic health records (EHR)s to enable better understanding of the relationship between the CVD outcomes and risk factors. Using UK dataset and Malaysian dataset the study will explore different DM-ML techniques with different set of risk factors attributes as inputs to build prediction models; and the study will use feature selection algorithm to evaluate quality of attributes in order to stratify the important of risk factors attributes. A 10-fold cross validation method will be used to evaluate and compare the performance of different DM-ML classification techniques towards different set of risk factors attributes input and different datasets. Performance of new model will then be compared with existing prediction scores i.e QRISK2 and TIMI to evaluate the use of DM-ML for medical development modelling.

(255 words)

Table of Contents

List of Tables

List of Figures


CVD Cardiovascular Disease

DM Data Mining

ML Machine Learning

DM-ML Data Mining/Machine Learning

EHR Electronic Health Record

SBP Systolic Blood Pressure

NB Naïve Bayes

KNN K-Nearest Neighbourhood

ANN Artificial Neural Network

FNN Fuzzy Neural Networ


This chapter establishes the context of the research which covers specific studies on cardiovascular disease (CVD) domain. General information about the disease - present background and problem, and further focus on the practice of prediction modelling in assessing CVD risk as part of prevention strategies. Then, it demonstrates how this context has motivated the research.

Background: Cardiovascular Disease (CVD)

CVD (Refer Appendix 2 on Cardiovascular Disease (CVD) and Cardiovascular Risk Factors) , term used to represent a set of diseases related to heart (cardio) and blood vessel (vascular) -; was responsible for the death of 17.3 million people in 2008 and it is predicted to increase to 23.6 million in 2030 (WHO 2011b). The most prevalent type of CVD that accounts for a significant proportion of these deaths is atherosclerosis i.e. stroke, heart attack and other cardiovascular heart diseases. Statistics shows this trend of high death rate is widespread not only among developing but also developed countries such as in US, UK and Australia. Although the death rate has been improved, it is CVD that causes most deaths for these countries (Heart-UK n.d; RACGP 2006).

Consequently, this high rate has become one of major economic burden factors. More demands and practice guidelines are required in providing treatment, care and support options for managing CVD. Countries with high number of CVD rate normally have to bear the loss of economic productive resources due to inability to work and premature death. It has been estimated that about 15% of healthcare delivery cost is accounted for by CVD, making it the single most expensive category(Waring and Cockcroft. 2007.). Most of European countries alone spent approximately around €169 billion a year on CVD (Leal et al. 2006). This cost includes health care cost, informal care cost and productivity loss. In the study done by (Abegunde et al. 2007) on the burden and economic loss due to stroke, heart diseases and diabetics in 23 low and middle income countries, has estimated an economic production loss of US84 billion dollars between 2006 to 2015 if no effort is being done to reduce the risk of CVD. The burden has lead to constant efforts not only in improving cardiovascular treatment and care but also in cardiovascular prevention.

Background: Prediction Modelling for CVD Risk Assessment

Prediction model, also known as prediction rule or risk score, uses a set of predictors in predicting the presence (diagnosis) or the occurrence (prognosis) of certain outcome (Toll et al. 2008). It is an aid for clinician to make more efficient decision about patient thus improve overall health care services. By identifying the risk of an individual patient, a better treatment and care, therapy or advice can be delivered to the patient. There are also advantages on cost benefit as reported by (Lloyd-Jones 2010; SIGN 2007).

Prediction model is used as the base in constructing CVD risk assessment tool. For instance, it able to predict development of CVD of a patient, identify patient with high risk and predict the risk of mortality. CVD risk assessment tools such as ASSIGN (SIGN 2007), FRAMINGAHM (NationalVascularDiseasePreventionAlliance 2012) , QRISK (Hippisley-Cox et al. 2007), SCORE (Germano et al. 2012) and WHO/ISH prediction chart (UNAIDS 2007) have now been widely employed for clinical use and accepted as part of primary prevention procedures (Refer Appendix 2 on Risk Assessment Tools for Cardiovascular Disease). Specific CVD such as Acute Coronary Syndrome (ACS) uses risk tool such as Thrombolysis in Myocardial Infraction (TIMI) score and Global Registry of Acute Coronary Events (GRACE) (Antman et al. 2000; Ramjane, Han and Jing 2009) to provide better intensive treatment strategies. The outcome from the assessment tool should serve clinician in providing better advice, care and treatment to patient in managing CVD events. This is a substantial contribution to promote CVD prevention. Thus, it is crucial to have a reliable prediction model in the development of medical risk assessment tools.

Prediction models have generally been derived from a sample data collected in a population (identified/randomized) and constructed using statistical method mainly Cox Hazard Regression model (Cooney, Dudina and Graham 2009; Kurz et al. 2009). Data are collected within a defined prospective, method and duration. However, growth of electronic health record (EHR)s within healthcare organization has shifted the methodology of deriving and validating prediction model (Cooney, Dudina and Graham 2009; Hippisley-Cox et al. 2008b). Taking advantage of rich volume of existing health records, QRISK2 derived the data from QResearch database collected through EHR system and validated the developed model using another EHR database i.e THIN database. Even though many issues have been raised issues on QRISK data collection, the findings concluded that in comparison with Framingham, QRISK presents better outcome both in discrimination and calibration (Hippisley-Cox et al. 2008a). This illustrates the potential of enormous information on medical knowledge that should not be ignored.

Research Motivation

The significance of developing reliable prediction model to support CVD primary prevention and vast amount of EHRs within healthcare organization have motivated the research in this domain. Using data mining/machine learning (DM-ML) techniques may possibly offer an alternative way to develop prediction model for CVD.

Data mining (DM) is a process of extracting useful knowledge from large dataset by analysing the pattern and the relationships of its attributes. Extracted knowledge can then be used to develop a prediction model. One approach of extracting the knowledge is by using DM-ML techniques (Refer Appendix 2 on Data Mining Techniques). Machine Learning (ML) is a subfield of artificial intelligent which designs and produces algorithms to learn from patterns to improve computer performance. DM-ML not only provides a platform for prediction modelling development but also a room to explore possibilities of knowledge from huge existing data (Refer Appendix 2 on Data Mining and Machine Learning).

Many efforts have been carried out using DM-ML techniques to extract knowledge from patterns in developing prediction models. DM-ML techniques have been applied, for instance in predicting chemical compounds, understanding customers buying pattern, classification of images from astronomic objects and in developing computer games by predicting the opponent's moves (Mehmed M. Kantardzic 2005). Particularly in medical domain (Refer Appendix 2 on Medical Data Mining and Medical Data Mining: Issues and Challenges), many interesting findings have been discovered such as in cancer prediction and prognosis (Cruz and Wishart 2006) , in extracting information from medical images through automatic classification (Uwimana and Ruiz 2008) and in identifying the status colorectal cancer testing from medical text (Denny et al. 2012).

Motivated by the advancement of DM-ML techniques together with potential of rich medical information on EHR, the research will explore and evaluate different DM-ML techniques to be used in CVD domain. The aim of the research is to study the use of DM-ML techniques in extracting knowledge of CVD from EHRs to enable better understanding of the relationship between the CVD outcomes and risk factors from the pattern. Modern techniques of developing reliable prediction model as opposed to traditional statistical methods will be used. Proving the robustness of the prediction model built using DM-ML techniques will provide another dimension in supporting healthcare industry for accurate and timely decision making.

Report Structure

The rest of the report is structured according to the following sections:



Chapter 2

Provides summary of literature on medical DM particularly in prediction modelling for CVD

Chapter 3

Presents the proposed research by outlining the aim and objectives, research questions and highlighting the contribution and challenges.

Chapter 4

Describes overall proposed methodology in achieving the aim and objectives

Chapter 5

Summarizes general arguments of the proposed research

Literature Review

This chapter presents the summary of literature review on current research in predictive DM for CVD. It highlights important findings on what constitute to a reliable prediction model, various interesting work in exploring different DM-ML techniques and strategies used for different prediction problem, and, comparison work of statistical method and DM-ML techniques.

DM-ML Prediction model

Advancement of technology has brought data mining into attention in healthcare industry. (Toll et al. 2008) defines the development of prediction rule as, "…identification of important predictors, assigning the relative weights to each predictor, estimating the rule's predictive accuracy, estimating the rule's potential for optimism using so-called validation techniques, and - if necessary-adjusting the rule for over- fitting". Reliable prediction model exploits appropriate model development techniques and strategies, has a valid performance measure, has an appropriate number of samples and has an appropriate risk factors (Cooney, Dudina and Graham 2009; Lloyd-Jones 2010).

Many prediction models developed using DM-ML has achieved satisfactory accuracy results (Green et al. 2006; Delen, Oztekin and Kong 2010; Westreich, Lessler and Funk 2010; Zhang et al. 2009; Khosla et al. 2010) . DM-ML prediction models have also been embedded in decision support system (Jyoti et al. 2011; Chi 2009). Prediction system developed using Artificial Neural Network (ANN) by (Nallaperuma and Lokuge 2011) had been accepted and suggested to be implemented in emergency room. Even though they concluded that there are still open issues in generalization capability of the system, we problems can be improved with better dataset and extensive external validation .

DM-ML Prediction Model for CVD

Many studies on CVD DM-ML were using University of California (UCI) at Irvine dataset. This may be due the difficulties of getting access to medical dataset. UCI has made a repository consisting of several non-identifiable dataset for ML researchers. However, UCI datasets are considered small, not exhaustive and rather old (Yang and Wu 2006). Despite that, UCI datasets are used as a platform to fairly compare the performance of different data mining techniques. (Rajkumar and Reena 2010) compared between K-Nearest Neighbourhood (KNN), Naïve Bayes (NB) and Decision List techniques. The result found that NB has better accuracy measure of 52.33% and faster execution time of 609ms. Even though (Shouman, Turner and Stocker June 2012) focused on K- NN voting techniques, they utilized other researchers' results from the same UCI datasets to establish convincing findings. A review on various DM-ML research utilizing UCI repositories for heart disease was studied by (Jyoti et al. 2011).

Though there are numerous attempts on CVD prediction using DM-ML, most of the work were using small sample with very minimal risk factors and less optimization strategies. Table : Sample Literatures Comparing DDM-ML Techniques presents sample of literature comparing several DM-ML techniques used in CVD. However, only (Kurz et al. 2009) did a comparison of DM-ML prediction model with the established risk assessment models.


Data Mining Techniques


(Palaniappan and Awang 2008)

Naïve Bayes

Neural Network

Decision Tree


(Delen, Oztekin and Tomak 2012)

Support Vector Machine

Decision Trees

Neural Network


(Dangare and Apte 2012)

Neural Network

Decision Trees

Naïve Bayes


(Rajkumar and Reena 2010)

Naïve Bayes

Decision List

K-Nearest Neighbour


(Babikier et al.)

Decision Tree (More superior when less noise)

Artificial Neural Network (More superior when more noise)

Naïve Bayes

Support Vector Machine

(Kangwanariyakul et al. 2010)

Back-propagation Neural Network

Bayesian Neural Network

Support Vector Machine


(Chen et al. 2007)

Support Vector Machine

Neural Network

Bayesian Neural Network

Decision Tree


(Sitar-Taut et al. 2009)

Naïve Bayes (Superior for coronary disease )

Decision Tree (Superior for Stroke)

** Same accuracy result for peripheral artery disease


Table : Sample Literatures Comparing DDM-ML Techniques

There are also attempts made to improve on existing algorithm and to propose new DM-ML techniques. (Kahramanli and Allahverdi 2008) introduced a new DM techniques by combining ANN with Fuzzy Neural Network (FNN). Evaluated on two different datasets, the findings show accepted sensitivity and specificity percentage of 80.3% and 87.3% for diabetic dataset and 93% and 78.5% for heart disease dataset. Whilst, (Patil, Joshi and Toshniwal 2010) combined k-means clustering technique and C4.5 classification technique to develop a new prediction model names Hybrid Prediction Model (HPM). Novel work was presented by (Zhong, Chow and He 2012) in improving the current SVM techniques by introducing a new technique called Multilevel-SVM (MSVM). It is claimed that MSVM has improved the efficiency of handling complex and large dataset.

From the review, most common data mining techniques being used are ANN, NB, SVM and Decision Tree. However, there are efforts being done using uncommon algorithm such as Rule Extraction for Medical Diagnosis (REMED) based on symbolic algorithm (Mena et al. 2012).

DM-ML can also be utilized to identify the importance of attributes in relation to the outcome. (Sitar-Taut et al. 2009) ranked the importance of risk factors towards coronary artery disease (CAD), stroke and peripheral artery disease (PAD). The work concluded that different CVD has different rank of important risk factors. In the study by (Khalilia, Chakraborty and Popescu 2011b) , Mean Decrease Gini measure was used to identify the important variable associated to different diseases. Whilst, (Delen, Oztekin and Tomak 2012) employed sensitivity analysis method to identify the most important variables effecting outcome of CABG surgeries.

DM-ML Improvement Strategies

The nature of medical data has offered further challenges in DM. Full attention is required not only on the selection of DM techniques but also the quality of data for modelling construction, evaluation techniques and in making the generalization from the findings.

Data quality is the most common issue in medical data mining and the problems should be minimized to ensure the reliability and accuracy of the model. Medical data is known to be incomplete, incorrect and missing, and complex by nature (Cios and Moore 2002; Bellazzi and Zupan 2008). Questions should be asked and clarified on such as what, where, why, how and by whom of the data before addressing the quality issues. For example, only certain range of systolic blood pressure (SBP) is used to access CVD. (Mena et al. 2012) excluded (which considered wrong for evaluating CVD outcome) all SBP values >260mmHg or < 70mmHg from the training set. It is also vital to get advice in understanding the dataset. In retrieving appropriate set of dataset, (Delen, Oztekin and Tomak 2012) seek expert help in identifying those with CABG surgery.

Another major concern in medical DM is 'missing data'. One simple way of handling missing data is by removing the whole record or attributes (if many missing values for the said attribute). However, this will reduce the total number of training set or the removed attribute might be one of significance predictor. (Delen, Oztekin and Kong 2010) decided to only remove attributes with 95% of missing data. Another way to handle missing data is to impute a significance value such as the mean or the common value for categorical type as implemented by (Green et al. 2006; Dangare and Apte 2012; Khosla et al. 2010). (Khosla et al. 2010) employed Linear Regression and Regularized Expectation Maximization for data imputation. (Grzymala-Busse and Hu 2001) compared 9 methods of handling missing data, and concluded that C4.5 method that based on entropy and splitting, and excluding the missing attributes are two superior methods than the others.

Predictors contribute significant influence towards the accuracy performance of the model, therefore required detail analysis in selecting the predictors. Distinct between a strong risk factors and a good predictor factors must be distinguished (Grobman and Stamilio 2006). (Kurz et al. 2009) study using average one-dependence estimator (AODE) algorithm proved that number of attributes does not really contribute to a better outcome. However, (Dangare and Apte 2012) found otherwise, by which the addition of two predictors did improve the prediction accuracy. This could be coincident that both additional predictors were good for the problem. Nevertheless, too many irrelevant features may lead to 'over-fitting' problem. It will also take a lot of processing resources and times. In data mining, feature selection technique is most commonly used to identify relevant attributes (Liu et al. 2010). (A.Sudha, P.Gayathri and N.Jaisankar 2012) employed subset feature selection algorithm, (Huang et al. 2004) used ReliefF algorithm, (Anbarasi, Anupriya and Iyengar 2010) used genetic algorithm and (Khosla et al. 2010) came out with its novel feature selection algorithm namely Conservative Mean feature selection .

Besides that, DM can also be used in combination with conventional statistical methods to make prediction more commendable. (Tham, Heng and Chin 2003) employed ANN to predict coronary heart disease by combing set of gene makers attributes with the typical risk factors as the predictors. To identify gene makers input for ANN, statistical methods principal components analysis (PCA) and factor analysis (FA) was employed. .

Imbalance data is another common concern in medical DM. The imbalance distribution of dataset will obviously create bias result. Due to high number of negative class, (Barakat, Bradley and Barakat 2010) used k-Means algorithm for sub-sampling method to select dataset for training. (Khalilia, Chakraborty and Popescu 2011a) implemented repeated sub-sampling method that based on ensemble learning to handle imbalance dataset. (Japkowicz 2000; Chawla 2010; Kotsiantis, Kanellopoulos and Pintelas 2006) explored different strategies and algorithms in handling imbalance dataset.

Discrimination and calibration form key elements in measuring the performance of prediction model in medical domain (Cooney, Dudina and Graham 2009). Discrimination is used to quantify the capability of the model to distinguishing positive or negative expected outcomes. Common method used to assess discrimination of prediction model is Area under Receiver (AUR) which evaluates the trade- off between its sensitivity and specificity (Siontis et al. 2012).

Calibration is a measure of how accurate the predicted outcome with the actual outcome. Typical measurement is by calculating the ratio of predicted and actual outcome. Many have acknowledged the importance of incorporating sensitivity and specificity in evaluating performance accuracy in medical DM (Delen 2009; Tu and Shin 2009; Kangwanariyakul et al. 2010; Chen et al. 2007; Huang et al. 2004).

In making generalization, it is imperative for medical community to know how the generalization is derived. DM techniques such as ANN and SVM are considered as "black-box" techniques, where the internal processes are hardly understood. (Barakat, Bradley and Barakat 2010) presented the potential of extracting rule from "black-box" model i.e. SVM using their former worked on SVM rule extraction method namely SQRex-SVM and eclectic.

Statistic Method vs. Machine Learning Method

Statistical has been in medical domain for long time; assisting medical practitioners in analysing medical data especially in epidemiology and prediction modelling. The most common statistical method used is Linear Regression. Interesting studies have shown DM-ML has better performance over statistical method (Westreich, Lessler and Funk 2010; Delen, Oztekin and Tomak 2012; Khosla et al. 2010; Kurz et al. 2009). However, there are also reports in favour of statistical method (Zhang et al. 2009; Sampson et al. 2011).


Comparatively, there is no superior DM-ML technique that has better performance over the others. It is more on how DM-ML techniques well fit with the problem and the kind of data source used for training. Problem and data source should be understood to incorporate supporting methods and strategies as necessary within mining process. Several improvement strategies have been demonstrated in the literature to solve different problems in medical DM. Full attention should be given on issues and challenges of medical DM especially in preparing dataset and in making final generalization from the outcome.

The believe is that the study will surpass previous works or attempts in these broad parameters; 1) The study will use more sample data training together with numerous potential good predictors. Better learning opportunities on the relationship of risk factors and CVD outcomes should be gained. 2) The study will employ several DM-ML classification techniques compliment with suitable improvement strategies that suit to the problem and data source. 3) The study will use two different dataset for comparison purposes in every aspect related to both DM and CVD domain.

The Proposed Research

This chapter specifies the aim and objectives of the research followed by a list of research questions should be answer by end of this research. The contributions to both DM and medical communities and possible major challenges are also presented.

Research Aim and Objectives

The aim of the research is to study the use of DM-ML techniques in extracting knowledge of CVD from EHRs to enable better understanding of the relationship between the CVD outcomes and risk factors from the pattern.

The objectives of the study are:-

To develop prediction models for CVD using DM-ML classification techniques.

To evaluate and compare possible DM-ML techniques in the development of prediction models for CVD.

To stratify the important risk factors that contribute to outcomes of CVD.

To evaluate the performance of prediction models against established risk assessment scores.

To achieve the objectives the following research questions should be answered:-

What is the suitable ML algorithm to rank the importance of risk factors for CVD? And how can the ranking of the risk factors can be used to establish more accurate prediction model?

Is there any impact on the prediction accuracy if different number of risk factors (attributes) is being used in the prediction?

How can prescribed drugs/treatments attribute be best used in the development of prediction models?

Which DM-ML classification techniques can construct robust prediction model for the dataset?


Since the research is a multi-disciplinary, it is expecting to contribute both DM-ML and medical communities.

DM-ML Community

To present valuable insight on different DM-ML techniques combined with different DM-ML strategies in solving specific medical problem; and to improve the understanding of the overall challenges in medical data mining.

Medical Community

To illustrate an alternative way of building prediction model utilizing existing data from EHR using DM-ML techniques. This in turn will provide value-added support to medical practices when making decision on prognosis and diagnosis of patients.


Major challenges may arise in these following:-

Understanding of dataset

Medical data is complex (Beale 2005) and is not captured for research purposes. In analysing the data, it is crucial to know how the attributes affect actual medical work. Details such as the purpose, collection process and timing related of the data must be understood. Any interdependency between attributes and the principle behind CVD risk factors must be well comprehended. This is crucial in selecting suitable techniques, pre-processing strategies and evaluation metrics, and in making generalization from the findings. Sufficient medical knowledge related to two distinct areas over two datasets should be acquired. And, coming from computer science background, this would not be an easy task.

Quality of data

It is known that medical data can be missing, incorrect, redundant, insufficient, inconsistent and incomplete values (Cios and Moore 2002), which definitely will affect the overall quality of extracted knowledge (Seifert 2004). Therefore, it is anticipated that more time will be required in preparing the data for modelling.

Research Methodology

This chapter presents the methodology of the research in achieving defined objectives.

Data will be extracted from two different sources for model development. The development will be guided by common process model i.e. Cross Industry Standard Process for Data Mining (CRISP-DM) as the framework. Three research objectives should be achieved from modelling development. Performance of developed models will then be compared to the 'gold standard' to evaluate the use of DM-ML for medical development modelling. Figure : Proposed Methodology illustrates our overall research methodology.

Refer Appendix 1 describing the overall plan for the research.

'Gold Standard'

Two 'gold standard's have been identified, each for the dataset; QRISK2 for IMPROVE-PC dataset and TIMI score for NCVD dataset. The 'gold standard' is used as benchmark to quantify the quality performance of the models. Even there are many argments regarding the accuracy of QRISK2 and TIMI model (Cooney, Dudina and Graham 2009; Chase et al. 2006), at least they are widely accepted, have been validated (Morrow et al. 2002; Hippisley-Cox et al. 2008b) and currently being applied in clinical practice.

'Gold Standard' Data Preparation

Analysis, Discussion and Conclusion


Model Comparison


NCVD-ML Vs TIMI Score Score



DM Methodology

Setting Learning Goals

Data Preparation

Understanding business and data source

Classification Modelling

Evaluation and Analysis

'Gold Standard' Score

Refinement & Optimization

Data Extraction

Figure : Proposed Methodology

Data Extraction

Data will be extracted from two different sources.

Improving Prevention of Vascular Events in Primary Care (IMPROVE-PC) dataset

The data contains an outcome from the IMPROVE-PC project which consists of merged data extracted from a SystemOne primary care, Myocardial Ischemia National Audit Project (MINAP) and Hospital Episode Statistics (HES). The records will include patients who have been diagnosed with CVD, having Leeds postcode, registered under selected GP, and registered as inpatient and outpatient in the hospitals. (Refer Appendix 3: Sample of IMPROVE-PC dataset)

National Cardiovascular Disease Database (NCVD), Malaysia dataset

The data contains 13,591 annonymized patient records, aged 18 and above with Acute Coronary Syndrome (ACS) including ST-elevation myocardial infarction (STEMI), non-STEMI and unstable angina (UA) admitted from 2006 to 2010 to 12 cardiac centres and general hospitals in Malaysia. The data will also include percutaneous coronary intervention (PCI) information, the ACS treatment that has been received by a patient (if applicable). Data will be retrieved in a text file format through a secured file transfer protocol with password to access the file. (Refer Appendix 4: Sample of NCVD dataset)

For the datasets to be feasible for the research, they should have at least the same risk factors calculated in QRISK2 score and TIMI score, and the outcomes for prediction. During preliminary stage, the availability of those attributes will be checked.

Microsoft Excel we will be used to analyse the characteristic of the data. However, if there is further need, SPSS will be used. Sample distribution, sample distribution by risk factors, number of records, number of attributes, categorization of attribute values and calculation of average values for an attribute are examples of data characteristic.

Model Development

Modelling Tools

Weka, an open source tool for data mining, will be used as the modelling tool. Weka has the environment and tools suitable for the research objectives. It has a range state-of-art algorithms to choose as well as the facilities to manipulate and parameterized the techniques and algorithms, and the environment to enhance and develop new ML algorithm.

Modelling Strategies

The model will be developed using 5 different DM-ML techniques, with different set of input of risk factors attributes going through several DM-ML refinement and optimization process. Table : Different Sets of Risk Factors Attributes describes different sets of risk factors attributes as inputs for modelling.

Set of Attributes


Inclusion of only risk factors used in 'gold standard'.

IMPROVE-PC (QRISK2): Age, ethnicity, gender, smoking status, family history, HDL Cholesterol, Total cholesterol, BMI, SBP, diabetic, area-based index of deprivation and antihypertensive treatment (Cooney, Dudina and Graham 2009)

NCVD (TIMI) : age, history of diabetics, hypertension or angina, systolic blood pressure, heart rate, Killip class II-IV, weight, St elevation, time to reperfusion therapy (Morrow et al. 2001).

To evaluate and compare the performance of prediction with the 'gold standard' scores

Inclusion of all other risk factors excluding drugs /treatment attributes

To evaluate the performance of prediction accuracy when drugs/treatment are excluded from risk factors

Inclusion of all risk factors together with drugs/treatment

To evaluate the performance of prediction accuracy when drugs/treatment included as risk factors

Table : Different Sets of Risk Factors Attributes

Modelling Methodology

CRISP-DM will be adopted to systematically implement the modelling process (Bellazzi and Zupan 2008; (NCR) et al. 2000; Wirth and Hipp 2000). The process includes the following steps: - 1) Setting the learning goals 2) Understanding the business and data sources 3) Data preparation 4) Modelling 5) Evaluation 6) Deployment as illustrated in Figure : CRISP-DM Phases (Azevedo 2008). The modelling will be executed iteratively due to different modelling techniques, with different refinement and optimization strategies, using different set of risk factors attributes, and testing for different datasets.

Data Understanding

Setting Learning Goals &

Business Understanding Understanding

Data Preparation





Figure : CRISP-DM Phases (Azevedo 2008)

Setting the learning goals

Specific learning goals for the research:-


To predict development of CVD for a patient based on CVD risk factors

To rank important risk factors contribute to prediction outcome


To predict mortality of a patient based on CVD risk factors

To rank important risk factors contribute to prediction outcome

Understanding the business and data sources

Understanding business and data source is important in preparing data for modelling and later in making generalization or conclusion from the outcomes. Details on how data are captured, for what purpose and how each attribute is mapped with domain practices have to be understood.

Data preparation

This is the process of preparing data for modelling development. Data preparation or pre-processing process is the most crucial task and may take huge amount of time compared to other DM steps. As the quality data is important in producing reliable prediction models, detail work should be undertaken at this stage to ensure the quality of input to the modelling process.

Attributes selection

Selecting quality attributes is crucial in producing reliable modelling outcome(Liu et al. 2010). Feature selection method (Liu et al. 2010; Kononenko and Kukar 2007) will be executed to evaluate the quality of attributes in influencing outcome and then will further validate with the domain expert for the final selection.

Any redundant attributes will be identified and dealt with.

Data will be categorized into:-


Risk factors



Data quality validation and verification

Analysis will be conducted to check for missing, incorrect and inconsistent data. Any 'defects' will be noted and accordingly suitable intervention measure will be introduced. All intervention measures must be acknowledged and to be considered in making final generalization on the modelling outcomes.

Sampling distribution

Distribution of sample data also needs to be evaluated. Imbalance data are common in medical domain and may potentially present imprecise outcome especially for classification and prediction task (Kotsiantis, Kanellopoulos and Pintelas 2006) . There are two ways of handling imbalance data; at data level and algorithm level (Kotsiantis, Kanellopoulos and Pintelas 2006). If imbalance data do arise, it will be tackled at data level to make it independent from any data mining techniques thus facilitate consistency for the modelling process. Incidences of imbalance will all be noted in making generalization from the modelling outcomes.

Data Transformation

Data will be analysed to ensure all data type of attributes are suitable for modelling process. There could be possibilities of using different data type for different DM-ML techniques.


Finalized attributes of the selected records should contain 'cleaned' records, having reasonable value and of appropriate data type; organized to match with the selected data mining techniques.


A preliminary study to select 5 DM techniques to develop prediction model will be conducted. It will evaluate on how well DM techniques responded towards the same data source. The main consideration is to employ techniques that present reliable outcome for prediction problems. Suggested by (Song et al. 2004), there is no obvious performance differences of similar learning techniques. Therefore, techniques of various types will be considered.

To rank the important risk factors, one of feature selection algorithms will be employed.

Evaluation and Analysis

A 10-fold cross validation method will be used to evaluate and compare the performance of different DM-ML classification techniques towards different set of risk factors attributes input and different datasets. It is a method that randomly divides the dataset into k-th number of subsets (folds) of equal size where K-1 folds are used as training set and the other 1 fold is used as testing set. K-Cross validation approach is said to be less bias in comparison to single split approach which divide 2/3 of dataset for training and 1/3 of dataset for testing(Delen, Cogdell and Kasap 2012). 10-fold is selected due to its common practice in DM and theoretically proven that within 10 times optimum estimate of error can be achieved (Olson and Delen 2008; Ian H. Witten 2005). The optimum class will be identified by calculating the mean and standard deviation.

It is important to include significance statistical measures in presenting the result (Sami 2006). In comparing prediction models, 4 performance criteria will be evaluated:- 1) classification accuracy 2) sensitivity 3) specificity 4) Matthews Correlation Coefficient (MCC).

Classification accuracy quantifies the number of positive and negative cases. Four components will be incorporated in measuring the accuracy i.e. True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).

Equation : Classification Accuracy

It is vital to view prediction accuracy rate towards both positive and negative cases as both have specific implication in medical domain. Sensitivity measures the result of positive cases while specificity measures the result of negative cases.

Equation : Sensitivity and Specificity

To further assess accuracy performance for comparison purposes, Receiver Operating Characteristics (ROC) and area under ROC (AUC) will be used. Using ROC, discrimination can be viewed by specifying different thresholds. AUC is claimed to be useful for imbalance dataset (Ian H. Witten 2005; Baldi et al. 2000).

Since, imbalance data is common in data mining, Matthews Correlation Coefficient (MCC) will be employed in the evaluation (Baldi et al. 2000). MCC measures the correlation coefficient between the predicted and observed outcome.

Equation : Matthews Correlation Coefficient

MCC will return a value ranging from +1 to -1. +1 represents perfect prediction while -1 represents otherwise. And 0 value represents average random prediction.

Processing time will also be measured for evaluation and comparison purposes. Speed of performance might be important in developing prediction model especially when it involves huge datasets.

Confusion matrix will be used to illustrate the result for different set of risk factors attributes running on 5 different DM techniques over two distinct datasets, we will present the result of performance metrics in confusion matrix.


To prepare for 4.5 Model comparison, all results and findings from modelling development need to be transformed into same 'metrics' as 'gold standard' score.

Data Preparation: 'gold standard' scores

This is a process to prepare 'gold standard' scores for 4.5 Model comparison.

Model comparison

To evaluate the use of DM-ML for medical development modelling, result of DM-ML modelling development will be compared against the 'gold standard' - QRISK2 and TIMI scores.

Analysis, Discussion and Conclusion

Findings from the comparison will be concluded and summarized the generalization by taking into account all other findings throughout the modelling process. All research questions should be firmly answered by end of this process.

Ethical Approval

Potential ethical issues may arise from the research as it involves with patient data. An approval has been granted by NCVD board in releasing the data for the research - (Refer Appendix 5: NCVD Application Form, Date Release Application Form and Data Release Agreement). An ethical approval from University of Leeds under Health Sciences Research Committee will be applied and tt is expected to get approve by end of December 2012.

For IMPROVE-PC dataset, the research should be bounded with the approved ethics (11/NE/0182) as the author is part of IMPROVE-PC team.


This research is an attempt to well-blend DM technology in medical field particularly in CVD domain. Answering the research questions should furnish understanding on the implementation of medical DM in CVD domain and the related issues.

The research will be beneficial to both DM-ML and medical communities. The findings will highlight the work on different DM-ML modelling techniques together with distinct refinement and optimization strategies in relations to medical problem and dataset. To DM-ML researches, the research could be a good enough platform for more advanced studies. Medical community will have the opportunity to gain from verified alternative in building prediction model for diagnosis and prognosis of CVD. They could also be encouraged of the underlying potential and application of EHR.

Appendix 1: Future Plan

Research Plan

Below chart summarizes the research plan for the next two years.

Training Plan




Machine Learning Course

To acquire basic concept and principle of different ML techniques.

To improve skills on ML with specific techniques and algorithms.

Currently attending modules for master programme in University of Leeds.

At the same time, it would be beneficial to attend few advance courses in solving specific medical DM problem.

Computing and Data Analysis

To acquire basic computing skills to effectively analyze data and results

Organized by Coursera (Coursera), an online website that provides free courses from top universities

It will be most helpful to attend few courses on statistical analysis.

Writing and Communication related courses

To improve our academic writing and presentation skills in for publications.

SDDU or any courses organized by University of Leeds

Conferences and Journal Papers

Given the opportunity, we plan to submit appropriate working papers to related conferences and journals. Chosen topics; to submit paper on the following:-

Pre-processing strategies for the dataset

The findings on the important ranks of risk factors

The impact of performance accuracy with additional prescribed drugs information and different set of risk factors

Comparison of different modelling strategies and different techniques in two distinct datasets.

Comparing the result of ML prediction model against QRISK2 and TIMI score respectively.

Besides that, we plan to join conferences related to DM-ML, and also on medical informatics. Occasionally, we plan to attend weekly colloquium (related to the research) from both School of Computing and Health Informatics.

Specific Contribution

If granted by NCVD, we plan to contribute a cleaned sample of NCVD dataset to UCI community to promote more research in medical DM. The current UCI dataset are fairly old and small in size.

Also, we are planning to help NCVD in improving their data dictionary documentation. This would benefit other researchers who wish to undertake research using NCVD dataset. In addition, we will propose to NCVD committee on the important attributes for constructing prediction model using NCVD dataset. By this, we hope data capturing through electronic health record in cardiac centres and in hospitals in Malaysia could be improved.

Appendix 2

Cardiovascular Disease (CVD)

CVD is a common term used to represent a set of diseases related to the heart (cardio) and blood vessels (vascular). Some of the common cardiovascular diseases are heart disease, heart attack and stroke. CVD is the most prevalent disease; has caused the deaths of 17.3 million world population in 2008 and is expected to rise to 23.6 million deaths by 2030(WHO 2011b). About 42% of CVD deaths are due to coronary heart disease and heart attack, almost 36% are due to stroke or any form of cerebrovascular disease, approximately 6% due to hypertensive heart disease, 2% of inflammatory heart diseases, 1% of rheumatic heart disease and the rest are other cardiovascular diseases(WHO 2011a).

About 78% of CVD death are commonly caused by atherosclerosis related. Atherosclerosis is the process of building up plague within blood vessel wall (lumen) by fatty material such as cholesterol and fat. Over time, these plaques become hardened and caused the blood vessel wall to narrow down. Later, plaques may rupture and result in blood clot that can disrupt the blood flow to all cells in the body.

Atherosclerosis takes a long process and symptoms are only noticeable when the disease becomes severe. However, most of CVD caused by atherosclerosis can be controlled and prevented by identifying the 'factors' (generally known as cardiovascular risk factors).

Cardiovascular Risk Factors

The term risk factors was first defined by Dr. William Kannel, the first director of Framingham study. The Framingham Heart study was initiated in 1948 to identify the factors accountable for CVD. The project was motivated by the increase death rate caused by atherosclerosis in US during 1930s and 1940s. The findings from the project have become the basis for CVD prevention strategies and formulation of effective treatment in clinical practice. It has also triggered others to undertake move in depth study on the subject (Framingham Heart Study : A Project of the National Heart 2012).

Widely accepted risk factors are age, gender, family history of CVD, blood pressure, smoking status, total cholesterol, hypertension, obesity and diabetics. However, there are other risk factors used for assessing CVD [Book : Cardiovascular rIsk Factors]. The risk factors can be further categorized into modifiable risk factors and non-modifiable risk factors. Modifiable risk is a set of risks that can be modified through intervention to reduce the probability of developing the disease. Examples of risk factors for this category are cholesterol level, obesity, diabetic type2 and smoking status. Whilst non-modifiable risk refers to risks that cannot be changed such as age, gender and family history. Even though non-modifiable risks cannot be modified, they are still an important set of risks to be evaluated when assessing for CVD.

Data Mining and Machine Learning

DM is a process of extracting information (knowledge) by identifying meaningful pattern from large dataset (Han and Kamber 2001). It employs intelligent techniques and structural methods to discover and describe the pattern, and evaluate the pattern based on interestingness measures. The advancement of software and hardware has allowed vast amount of data being captured and stored with different kind of data ranging from simple data types to complex data types. This scenario has created a need to intelligently process the data, identify interesting pattern and transform them into useful information and knowledge. Therefore extracting the right and meaningful information is vital to benefit especially in acquiring economic advantage. Due to this, DM has become a major evolution in information industry. DM has been applied in many areas such as marketing and retail, finance and banking, engineering, sports and medical and health industry.

As for example, DM has been employed to improve current business process or quality of product, to anticipate future trends in planning strategies, identifying risks, formulating prevention measures, image interpretation and pattern recognition (Han and Kamber 2001; Wang et al. 2012; Choudhary, Harding and Tiwari 2009; Delen, Cogdell and Kasap 2012; Ian H. Witten 2005).

Depending on the problem to be solved and the nature of data source, different data mining techniques are employed. Each technique has different way of searching, extracting and representing the pattern. Below are common data mining tasks (Ian H. Witten 2005; Sahu, Shrma and Gondhalakar ; Sholom M. Weiss 1998) :-

Classification and prediction: The task of finding generalizes features of known classes or concepts. Further, it can then be used for prediction.

Concept and class description: The task of summarizing and characterizing the concept or classes. The task normally compliments with data visualization which describe the concept or classes in more concise ways.

Association analysis: The task of finding the relationships between the attribute values

Cluster analysis: The task of discovering classes or concepts from unknown by identifying their common and similar features.

Outlier analysis: The task of detecting uncommon or rare patterns from the data objects

Machine learning is a multidisciplinary fields of artificial intelligent, statistical, probabilities, computational complexity theory, information theory, learning theory and other fields (Meyfroidt et al. 2009). In broad, it focuses on the development of algorithms to learn from experience aiming to improve the performance of the system over time (Mitchell. 1997). In the context of data mining, machine learning acquire its experience by allowing the computer to learn from the example i.e. a set of data source normally referred as training set. From the learning process, machine should gain new knowledge from the outcome which often known as model. Ultimately the new knowledge or model is used for classifying new data, understanding the relationships or predicting trends.

Data Mining Techniques

DM, also known as ML techniques are commonly divided into three major categories defined by how the machine learns and searches the pattern (Kononenko and Kukar 2007; Mitchell. 1997; Lavrač 1999; Kotsiantis, Zaharakis and Pintelas 2007). They are 1) Inductive learning 2) statistical learning and 3) ANN. Inductive learning or symbolic rule uses examples to learn and making generalization rule from the observation. Common techniques under this category are Decision Tree, Decision Rule, Association Rules, and Regression Tree. Statistical learning adhere the typical statistical techniques based on probabilistic and mathematical functional. SVM and NB are considered among the successful for this category. Others techniques are K-NN and Linear Regression. While ANN produces output by learning on complex interaction between artificial neurons calculated in hidden level. Table : Common DM-ML Techniques describes the common DM-ML techniques with its advantages and disadvantages.

DM-ML Technique





ANN is an (AI) learning which gained the concept from how information being processed in human brain. ANN involves 3 components 1) input layer 2) hidden layer 3) output layer.

The learning is based on the inputs (neurons) and how these neurons are complexly connected to each other. The output will be the sum of weightage calculated in hidden layer based on the relationships between neurons.

Good predictive performance

Tend to perform better when dealing with multi -dimensions and continuous features

Robust to error

High computation cost in training

Complex method - difficult to interpret by domain expert

Decision Tree

Decision lidentifies classes from the root, and recursively extends the tree down to branches and leaves. The branch represents the attribute of an instance to be expanded depending on its possible values that lead to class labels. The recursion will continue until all data have been classified.

Prone to noisy-error and missing value

Simple and easy to implement - less computational complexity

The learning can easily examine by human - easy to read and interpret

Possible of over-fitting and generate complex model when the tree is growing exponentially.

Naïve Bayes

Naïve Bayes classifies using probabilistic calculation which can be manipulated by specifying some weightage. The underlying of the techniques is Bayes theorem.

Can easily combined with prior knowledge

Can be used to make probabilistic prediction

Able to tolerate missing values

High computational cost


SVM classifies by identifying the hyper-planes to separate the datasets. It is the newest DM-ML techniques based on strong mathematical foundations and statistical theory.

Support large attributes

Claimed to have high accuracy

K-Nearest Neighbour's (k-NN)

KNN is of type instance based learning that classifies by identifying the closest training data in the feature space.

among the simplest machine learning algorithm

High computational cost

can easily disgraded by the presence of noisy or irrelevant features or if the feature scales are not consistent with their importance

Table : Common DM-ML Techniques

Medical Data Mining

DM has demonstrated successful application in many fields such as in finance, marketing and retail. A wide massive data has encouraged the development and evolution of DM application. In comparison with other fields, medical field are considered new in DM application. According to a survey by Kdnuggets on DM application in 2008, only 10.3% of data mining application being applied in health care/HR and only 7.5% in medical/pharmacist(KDnuggets 2011).

Nevertheless, growth of medical data has encouraged DM in medical field. Outcomes from various researches in medical DM show great potential of DM to improve efficiency and cost saving of clinical administration as well as in clinical treatment and care (Koh and Tan 2011). The prediction model developed by (Zhong, Chow and He 2012) using new hybrid DM-ML technique has shown the potential to improve the management of cost and budget for hospital administration. While (Chazard et al. 2011) and (Bate, Lindquist and Edwards 2008) used DM to determine adverse drugs events.

Major appreciation of DM is it able to help in making decision for prognosis, and diagnosis, as well as for treatment recommendation (Pogorelc, Bosnić and Gams 2012; Delen, Walker and Kadam 2005). DM-ML techniques also used in various decision support system development for rule generation (Tenorio et al. 2011; Del Fiol and Haug 2009). In typical medical decision support system, rules are generated based on expert knowledge. However, by using DM rules are generated by the system and later validated by domain expert; thus promotes efficiency in the system development.

There is also contribution of DM in evidence-based medicine (Stolba and Tjoa 2006). Evidence- based medicine is a new medical practice which uses clinical result as an evidence for prognosis, diagnosis and making clinical decision for treatment. The knowledge extracted from large complex healthcare dataset is important evidence that should not be neglected. (Delen, Oztekin and Kong 2010), had presented the use of DM to identify more complex predictive factors in predicting survival time after transplantations. The capabilities of DM to handle huge data sets with tolerable performance time have encouraged continuous employment of data mining in medical research (Delen 2009; Sampson et al. 2011) .

Though significant work have demonstrated potential of DM in medical field, there are also works such as (Sami 2006) that was rejected by medical community. Medical DM is considered unique because it inherits the complexity of medical data and the domain. Considerable studies have been initiated to look at the uniqueness of medical DM such as (Kaur and Wasan 2006) (Lavrač 1999) (Harper 2005)

Issues and challenges in mining medical data must also be well aware. A clear DM goal must be specified and the selection of DM-ML technique must be appropriate to solve the said problem and for the data source (Shillabeer and Roddick 2007).

Medical Data Mining: Issues and Challenges

Medical DM lays various challenges and issues (Shillabeer and Roddick 2007; Cios and Moore 2002; Sami 2006; Iavindrasana et al. 2009). Main issues and challenges faces medical DM can be categorized into privacy, ethics and confidentiality issues, complexity of medical data, quality of medical data and uniqueness of medical approach.

Privacy, ethics and confidentiality issues

The main subject of medical data is human i.e. the patient. In order to protect medical data from being misused and abused, enormous ethical and legal law have been put in place. These legal law has somehow creates limitation and difficulties in mining medical data (Khalilia, Chakraborty and Popescu 2011b). Lengthy procedures and approval need to go through in getting the medical data. Data might be stored in various database systems which owned and managed by different parties making the process of getting data more challenging.

To secure the privacy and confidentiality of patients' data, some data may need to be anonymized i.e. a process of unlink the identification of the patient (Cios and Moore 2002; Meystre et al. 2010). This anonymization process can be tedious and complex due to voluminous data. Once the dataset set is available for mining, other security measure such as encryption procedure, data access rights policy, backup and recovery plan need to be in place. All these challenges can be one of the biggest challenges in medical data mining.

Complexity of medical data

According to (Fayyad, PiatetskyShapiro and Smyth 1996), challenges may arise due to nature of data and granularities of knowledge to be extracted. Complexity of medical data mainly originates from the nature of biological and social complexities of a patient (Beale 2005). On top of that, the growth of data is extremely fast and sizeable. A patient admitted in ICU may have 50 or more parameters being collected per hour.

Heterogeneity also contributes to the complexity of medical data. Heterogeneity of medical data lies in the different source of data, different kind of data and originates from different systems. Data may originate either from doctors, clinician or even the health administrator (Hayrinen, Saranto and Nykanen 2008). Especially for text value, different persons may describe differently resulting in different interpretation. Medical data are captured for different purposes. Input for diagnosis, prognosis and treatment have its own purpose and meaning in medical dataset (Hayrinen, Saranto and Nykanen 2008).

There are also different types of data being captured in medical database ranging from numerical value, images, sounds and unstructured free texts (Cios and Moore 2002). Sounds and free text values can easily be ambiguous, inconsistent and vague. The characteristic of medical data which have no standard and formal structure render further challenges in medical DM. Due to this uniqueness of medical data, suitable DM techniques and methods should be applied. (Shillabeer and Roddick 2007).

Quality of medical data

Quality of data is evaluated based on its correctness and completeness (Hogan and Wagner 1997). There are numerous missing values exists in medical database (Cios and Moore 2002). Medical data is prone to missing data, incorrect, redundant, insufficient, inconsistent and incomplete values. This is largely due to human errors resulting from various inputs either from different parties or from different levels or for different purposes. It can also be due to technical error such as no transaction roll back when the system experiencing runtime database problem. Quality data problems will definitely affect overall quality of DM-ML and to resolve the quality issues required more time in data preparation phase (Seifert 2004).

Uniqueness of medical approach

Any decision making in medical is considered critical as it deals with human life. Due to this, rules or prediction models generated from DM-ML are not easily accepted by medical community if there is no specific explanation on how they are derived (Shillabeer and Roddick 2007; Kononenko 2001). To many medical practitioners, any knowledge or findings extracted from DM process must be supported and validated with clinical or scientific evidence (Shillabeer and Roddick 2007).

In measuring the accuracy of DM outcome, it is vital to inherit the paradigm applied in medical field (Shillabeer and Roddick 2007). For example, it is important to consider negative rates for the accuracy measurement as not all diagnosis and treatments are imprecise. Thus, sensitivity and specificity analysis are more meaningful to medical domain as its consider both positive and negative classes (Cios and Moore 2002).

In conclusion, understanding the issues and challenges of medical DM is substantial to minimize problems resulting to poor outcomes. The unique challenges in medical DM have allowed researchers to explore ways and techniques in tackling the issues (Prather et al. 1997; McAullay et al. 2005; Lavrač 1999; Bellazzi and Zupan 2008). They believe is that understanding the underlying issue of medical data mining and choosing the right techniques and tools for appropriate medical DM problem and data source, can be essential aid for medical practitioners and medical researchers.

Risk Assessment Tools for Cardiovascular Disease

Prediction mode, also known as prediction rule or risk score, uses a set of predictors in predicting the presence (diagnosis)or the occurrence of certain outcome (Toll et al. 2008). It is an aid for clinician to make more inform decision about patient thus improves overall health care services. By identifying the risk of a patient, better treatment and care, therapy or advice can be delivered to the patient. There are also advantages on cost benefit as reported by (Lloyd-Jones 2010; SIGN 2007). However, prediction model or any tools derived from the model is not to replace the