# Arrhythmia Classification From Ecg Signals Using Data Mining Approaches Accounting Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The objective of this paper is to develop a model for ECG (electrocardiogram) classification based on Data Mining techniques. The MIT- BIH Arrhythmia database was used for ECG classical features analysis. This work is divided into two important parts. The first parts deals with extraction and automatic analysis for different waves of electrocardiogram by time domain analysis and the second one concerns the extraction decision making support by the technique of Data Mining for detection of EGC pathologies. Two pathologies are considered: atrial fibrillation and right bundle branch block. Some decision tree classification algorithms currently in use, including C4.5, Improved C4.5, CHAID and Improved CHAID are performed for performance analysis. The bootstrapping and the cross-validation methods are used for accuracy estimation of these classifiers designed for discrimination. The Bootstrap with pruning by 5 attributes achieves the best performance managing to classify correctly.

Index Terms- ECG, MIT-BIH, Data Mining, Decision Tree, classification rules.

Introduction

The analysis of the cardiac signal ECG is used very extensively in the different pathology diagnosis. The research on pathology consists in detecting and identifying the different waves constituting the ECG signal, to measure their lengths as well as their amplitudes and in short to establish a diagnosis.

Atrial fibrillation represents one of the most current cardiac arrhythmias and corresponds to the dysfunction of atrial. It occurs at 2% to 5% of people of more than 60 years and at 10% of people of more than 70 years. It is the result of disorganization in the electric activity of atrial. The analysis of the P wave is therefore very important in the case of subjects with atrial fibrillation risk. Whereas right bundle branch block RBBB corresponds to deterioration of atrioventricular conduction in the right side of the heart during the chronic stage of the disease. The analysis of the PR interval and QRS complex is very important in the case of subjects with risk of right bundle branch block.

Hence the ECG interpretation is important for cardiologists to decide diagnostic categories of cardiac problems. In the problems which are a matter of pattern recognition, the need is to use reliable methods that maintain the data structure, that do not call for very high statistical hypotheses, and that provide models easy to interpret. Among the techniques which correspond best to these characteristics, the Data Mining takes an important place. Data Mining is an iterative process within which progress is defined by discovery, either through automatic or manual methods. Data Mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an interesting outcome. The Data Mining classification techniques are used efficiently in detecting firms that issue fraudulent financial statements FFS and identifying the factors associated to FFS [20]. Also the application of Data Mining approaches for physiological signal classification is a fertile area of research. Many works and special investigations were conducted to diagnose cancer diseases, have also used Data Mining successfully. However, research on the application of Data Mining techniques for the purpose of detection of ECG anomaly activities has rather been minimal.

In this study, we carry out an in-depth examination of publicly available data from the MIT-BIH Arrhythmia database [1] in order to detect some ECG abnormalities by using Data Mining classification methods. One main objective is to introduce, apply, and evaluate the use of Data Mining methods in differentiating between some clinical and pathological observations.

In this study, Data Mining technique is tested for its applicability in ECG abnormalities detection and classification. We used the Decision Tree technique to identify the variables that mostly affect the ECG. Four algorithms are compared in terms of their predictive accuracy. The input data consists mainly of ECG feature extraction. The sample contains data from The MIT BIH database. It consists of ECG recordings of approximately 30 minutes and sampled at 360 Hz.

The paper proceeds as follows: Section 2 describes the electrocardiogram signals. Section 3 reviews relevant prior research. Section 4 provides an insight into the used research methodology. Section 5 presents the developed models and analyzes the results. Finally, Section 6 presents the conclusions.

Electrocardiograph Signals

The ECG waveform is divided into P, Q, R, S, T and U elements [2]. The principal waves are P_wave and the QRS complex. P wave corresponds to atrial depolarization that shows contraction of left and right atria, his duration is between 0.06 to 0.12 seconds for a normal contraction. QRS complex represents depolarization of the ventricles. The duration of the QRS complex is less than 0.1 seconds for normal ventricles contraction. The T wave represents ventricles' depolarization which set up the cardiac muscle for another contraction. The PR interval is the conduction time required for an electrical impulse to be conducted from the atria to the ventricles. The duration is normally 0.12 to 0.20 seconds and is used to diagnose heart block problems [2]. The ST segment corresponds to the period of uniform excitation of the ventricles until the phase of ventricle recuperation. It is measured from the end of the S wave or R until the beginning of the T wave.

The heart rate for normal rhythm is between 60 to 100 bpm. The ECG strips are best interpreted from lead II or lead VI which shows the most clearly rhythm Fig .1 of the heart according to Einthoven's Triangle [2].

Fig.1. ECG signal with QRS complex, P-wave, T-wave, and U- wave indicated.

Prior research

The earlier method of ECG signal analysis was based on time domain method, but this is not the only method used to study all the features of ECG signals. Hence, the frequency representation of a signal is highly required. To accomplish this, FFT (Fast Fourier Transform) technique is applied. But the unavoidable limitation of this FFT is that the technique failed to provide the information regarding the exact location of frequency components in time. So over the last decade, several new algorithms have been proposed, such as using neural network methods, genetic algorithms, wavelet transform, as well as the heuristic methods. In this works we used the classic method of Pan and Tompkins [3]. This method presents the advantage of the simplicity and the speed of execution. The first information derived from the shape of the ECG wave is whether the heartbeat is normal or aberrant. When analyzing an ECG, the physician relies on experts' knowledge for this discrimination such as width of the P wave, PR time interval, depolarization axis, etc. Therefore, full advantage can be taken of experts' knowledge in order to build the Mining model.

In prior research concerned with Data Mining for data analysis, several method have been used in particular Fuzzy and Neural Network Algorithm [12], machine learning methods[13], statistical or pattern-recognition methods (such as k-nearest neighbors and Bayesian classifiers [14,15], and heuristic approaches [16], expert systems [17], Markov models [18], self-organizing map [19]. These include diagnostic problems in oncology, neuropsychology, and gynaecology [4]. Improved medical diagnosis and prognosis may be achieved through automatic analysis of patient data stored in medical records.

SIPINA is a one of Data Mining tools and a machine learning method. It is used for experimentations and studies in real world applications. It corresponds to an algorithm for the induction of decision graphs [5]. A decision graph is a generalization of a decision tree where we can merge any two terminal nodes of the graph, and not only the leaves issued from the same node.

Methodologies

The first step in our decision rules extraction chain is the detection and extraction features from the MIT- BIH ECG data. The output of this first stage provides six EGC features. After processing, the dataset is exported into excel format to SIPINA. the decision tree generator for the classification analysis is then invoked to generate decision tree. The overall performance of Mining association rules is determined by the second step. These rules are used for training and testing the decision tree based classification system. Therefore, the feature extraction for EGC Data Mining procedure is organised in three steps Fig. 2: the pre-processing, the processing and the decision rules extraction.

## Detection

## QRS

## Extraction feature of the ECG

## Decision tree

## Validation

## Processing

## Decision rules

## Signal Classified

Fig.2. description of extraction of decision rules

This section focuses on the first step heartbeat pre- processing and processing, which is a prerequisite for ECG analysis. Therefore the methodology is briefly described.

Data collection and pre-processing

An important task in any Data Mining application is the creation of a suitable target dataset to which Data Mining can be applied. This is particularly important in ECG usage mining because of the complexity of the signal. The process involves pre-processing and transforming the original data into a suitable form for the input into specific Data Mining. For the purpose, the Matlab environment is used which is an interactive and convivial system of numeric calculation and graphic visualization. The loading of the ECG signal under Matlab constitutes the first step in our algorithm; it consists in converting the data coded in the initial shape of the MIT BIH database in a format that is interpretable by Matlab. ECG-baseline fluctuation was corrected by applying a high pass filter with a cut-off frequency of 0.6Hz Fig. 3.

The ECG signal passes through a set of filters that permit to extract some information on the amplitude, the duration and the rhythm of the QRS complex. The first filtering stage is a Band-pass filtering. Its objective is to eliminate the noise outside of the spectral bandwidth of the complex. The output signal passes through a derivate filter which detects the abrupt variations of the signal slope and, therefore, the QRS complex. A quadratic filter, which is placed in the continuation, makes the totality of points of the signal positive and amplifies the output of the derivate filter which strengthens the variation. The wave is detected if it exceeds a certain threshold. But all other waves, which are of no interest, remain below the threshold. Finally, we can determine the position which is assigned to a fiducial point among those that exceed the threshold (maximum of the wave, maximum of an interpolation of the wave, midpoint between the two threshold crossings, etc...).

Fig.3. 100.dat before and after baseline elimination

Extraction feature of the ECG

Detection of position and amplitude of R wave

The use of the coherent averaging technique applied to the ECG signal implies the location of a fiducial point as a synchronisation reference. An algorithm operable in real time has been developed to assign a position to the wave. The alignment point for wave i lies between Start_zone and end_zone Fig. 4. There are two possibilities. In the first possibility, the point i can be the maximum of the wave which corresponds to the maximum of amplitude Pos1(i). In the second possibility, it can be the midpoint between Start_zone(i) and end_zone(i) which is valid only in symmetric waves and corresponds to Pos2(i) Fig.4. In this work the position of R_pick corresponds to the position when the signal is maximum between Start_zone(i) and end_zone(i)

Start_zone(i))

Pos2(i)

Pos1(i)

End_zone(i)

Fig.4. Fiducial point of the wave

Detection of position and amplitude of P wave

To be able to segment the wave P, it is imperative to detect the QRS wave of each cycle and then search backwards for finding the P wave [6] Fig. 5.

## No

## Yes

## Start

## Pre-processing

## Detection of QRS complex

## R_pos(1) - search backward >1

## rpos=rpos(2: length(rpos))

## Set window and find maximum

## P_pos, p_pick

## R_pos detection

Fig.5. Flowchart of the P_pos detection

Detection of Start and End of wave

A useful feature for diagnosis is the distance between the characteristic waves, which is obtained from the position of the beginning and the end of each wave. The start or the end of wave can be provided by thresholding the derivate. The principle of this method is based on an isoelectric segment that has theoretically a derivative zero or at least close to zero. Whereas the wave contributes with a significant derivative, except around its extrema [6]. Hence the procedure starts at the maximum of the wave and search for the maximum derivative towards the start or end of the wave; then follow this direction until the absolute derivative falls below a certain fraction of the maximum Fig. 6.

Fig.6. ECG 209.DAT with R_pick, P_pick, start and end of wave.

Decision tree models

In decision analysis, decision trees are used to represent decisions and decision making. In Data Mining, a decision tree describes data but not decisions; the resulting classification tree can rather be an input for decision making. Decision tree is one of the most important method for classification. It is built by given data, the data value and character. Both the amount and the type of attribute value affect the result of tree building procedure. Decision tree needs two kinds of data: training and testing data. The training data is the bigger part of data and tree construction procedure is based on them. The more training data perform the higher accuracy of its decision. The testing data gives the accuracy and misclassification rate of decision tree. There are a lot of decision tree algorithms. C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan [7]. C4.5 is an algorithm of supervised classification, published by Ross Quinlan. It's based on the ID3 algorithm to which it brings several improvements. CART (Classification and Regression Tree) is a data exploration and prediction algorithm, it is similar to C4.5 in tree construction algorithm. CHAID (Chi square Automatic Interaction Detector) is similar to CART, but differ in choosing split node. It depends on chi square test used in contingency tables to determine which categorical predictor is farthest from independence with the prediction values. And it also has an extended version Improved CHAID. Decision trees may not be the best method for classification accuracy, but they have the capability to identify important variables and to handle interactions between them. Further more, it is easy to interpret the tree rules which are very compact and ideal for implementation in real-time control systems with embedded processors.

Improved C4.5

The C4.5 Algorithm used a mechanism of learning. The attribute selection of algorithm is based on an assumption that relies on the complexity of decision tree and the amount of information. The latters are represented by given attribute which are closely linked. C4.5 expands the classify range to numerous attribute. The algorithm is based on the information entropy which is contained by the produced nodal points of decision tree. The entropy is representative of degree of object disorder in the systematology. The small entropy induces the small disorder [8].

The improved C4.5 algorithm introduced a degree of balance coefficient based on traditional C4.5 algorithm. It can improve the information entropy of some properties artificially in the classification decision. Relevantly, it reduces the information entropy of other attributes, to improve the rules of the classification and the results. The decision tree that constructed by the improved algorithm has a higher veracity. It expands the practical applications field of the decision tree algorithm. Hence, it played an important role in promoting the research and development of Data Mining technology [9].

Improved CHAID

The classification tree CHAID (Chi - Square Automatic Interaction Detection) is a type of decision tree, which can automatically prune, with sensitive and visual segmentation rule [10].

The CHAID algorithm starts at a root tree node. It divides it into child tree nodes until leaf tree nodes terminate branching. The splits are determined using the chi-squared test. When the decision tree is constructed, it is easy to convert the tree into a set of rules by deriving a rule for each path in the tree that starts at the root and ends at the leaf node. Decision rules are often represented in decision table formalism. A decision table represents an exhaustive set of expressions that link conditions to particular actions or decisions [11].

The improved CHAID is directly derived from CHAID. It brings some improvements in order to better control the size of the tree. As well as, the Tschuprow's t is used instead of standard CHI-square.

Experiments and results analysis

After we digitalized the ECG data (2034 records and 6 attributes), 1423 records are to used as training sets, 611 records are considered as the testing data sets. Decision trees handle both continuous and categorical variables. Therefore, in our work we used 6 continuous variables and one discrete variable to scrutinize right bundle branch block (RBBB), Atrial fibrillation (AF), normal (N) and others (O) pathologies Table 1. From that, we get 6 raw data, which are defined by doctors and our experiences in analysis. We used these raws of data as the experiment data. The analysis methods in the experiment are C4.5, Improved C4.5, CHAID and Improved CHAID. Each algorithm is performed on the raw data to compare the decision tree parameters and correction rate.

## Table 1

description of variables

## Variables

## Description

## Nature

## amp_r

## Amplitude of R wave

## continuous

## amp_p

## Amplitude of P wave

## continuous

## dur_r

## Length of R wave

## continuous

## dur_p

## Length of P wave

## continuous

## dur_rr

## Length between 2 consecutive R wave

## continuous

## Seg_pr

## time backwards from QRS which defines end of the search window

## continuous

## state

## State of beat

## discrete

The Decision Tree model is constructed using the Sipina Research Edition software. We used the whole training sample as a training set. Fig. 7 shows the constructed Decision Tree. The model was tested against the training sample and managed to correctly classify 1423 beats as can be seen in Fig. 7, the algorithm uses the variable amp_p as first splitter. For example, if amp_p present a considerably low amp_p value (amp_p < 0.02) we have 375 out of the 462 beats N, 9 out of 564 beats AF, 34 out of 217 beats RBBB.

As second level splitters, the variable amp_r is used. In this stage, for low amp_r value (amp_r <1.56) no RBBB is present, whereas for high RBBB amp_r value RBBB is detected. Table 2 depicts the splitting variables in the order in which they appear in the Decision Tree.

Table 2

Variables

amp_p

amp_r

dur_r

seg_pr

dur_r

dur_p

The splitting variables

Fig.7. Decision tree by C4.5 classifier

All the features were extracted for this particular application. We used the features extracted from the training ECG data, on the testing ECG signals. Hence in essence, we have developed a system, which is trained once and then applies the same technique to other signals. The performances of these features can be assessed by certain evaluation criteria. The second set of ECG data was used for comparison and validation.

The models validation

In machine learning, it is essential to be able to compare results from different algorithms or statistic variations, to decide which is best for a given application. In this section we propose to analyse different decision tree methods using two model of validation, Cross-validation and bootstrapping model, for performance evaluation. Cross-validation and bootstrapping are both methods for estimating generalization error based on re-sampling. cross-validation and bootstrapping are the most commonly used methods for estimation of the unknown performance of a classifier designed for discrimination. The importance of reliable performance estimation when using small data sets must not be underestimated. Sipina implements some model assessments in particular cross-validation and bootstrap.

Cross validation

In k-fold cross-validation, the data are divided into k subsets of approximately equal size. Then the net is trained k times, each time leaving out one of the subsets from training, but using the omitted subset to compute the prediction error. The mean of these K values is the cross validation estimate of the extra-sample error. Table 3 shows the cross validation error rate for the different decision tree methods. Parameter selection is done by 10-fold cross validation on the training set. C4.5 classifier provides the best discrimination rates of 14.69 % Tab 3. The cross validation method shows some improvements when compared to the error rate training.

## TABLE 3

Error rate

Method

Error rate

Training

Error Rate

Cross validation

C4.5

15.55 %

14.69 %

Improved C4.5

18.66 %

15.88 %

CHAID

14.73 %

18.20 %

Improved CHAID

29.62 %

25.02 %

Bootstrap

The bootstrap method is a general re-sampling procedure for estimating the distributions of statistics based on independent observations. It consists in generating multiple statistically equivalent data sets from a few amounts of data. Instead of repeatedly analyzing subsets of the data, data sub-samples are repeatedly analyzed. Each sub-sample is a random sample with replacement from the full sample. There are some sophisticated bootstrap methods that can be used for estimating generalization error in classification problems such as the famous .632 bootstrap which has the advantage of well performing in case of over fitting. The estimated prediction error errBt, is calculated by considering M samples, {d1,d2,…., dM }of size n, drawn with replacement from the training set, the model is estimated on each sample dj. errBt is calculated considering a test sample which not included in dj. The Leave-one-out bootstrap estimator of err is defined as:

(1)

To improve errBt, Efron [32] proposed the 0.632 bootstrap estimator:

(2)

C4.5 classifier provides the best discrimination rates of 12.48 % Tab 4. Bootstrap method shows some improvements when compared to error rate training.

## Table 4

Error rate

method

Error rate

Training

Error rate

Bootstrap

C4.5

15.55 %

12.48 %

Improved C4.5

18.66 %

14.08 %

CHAID

14.73 %

13.74 %

Improved CHAID

29.62 %

24.77 %

In our simulations, optimal performance appears evident for the estimators based on the bootstrap model. Less satisfactory results were obtained the Cross validation method.

Pruning

The selection of variables is an essential aspect of the overseen supervised classification. We have to determine the relevant attributes for the prediction of the variable values to predict. the approach WRAPPER is used for its capability to optimize performance criteria related to the error rate. Table 5 shows the error rate according to a number of variables for several classification methods. Bootstrapping seems to work better than cross-validation in many cases. In terms of performance, the Bootstrap model achieved the best performance managing to classify correctly. It provides the minimum error rate 11.92 % when we used 5 variables for classification by CHAID.

## TABLE 5

Error rate according to number of variables

amp_p amp_r

amp_p amp_r

seg_pr

amp_p amp_r

seg_pr dur_r

amp_p amp_r

seg_pr dur_r

dur_rr

amp_p amp_r

seg_pr dur_r

dur_rr dur_p

Improved CHAID

B

24.63

24.54

25.01

24.45

24.65

CV

26.35

24.94

25.37

25.58

25.72

CHAID

B

20.47

14.83

12.68

13.01

12.92

CV

23.26

18.34

17.85

15.81

17.08

Improved C4.5

B

19.26

15.22

14.26

13.42

14.19

CV

21.22

17.14

15.39

15.39

16.8 %

C4.5

B

19.86

14.55

13.11

11.92

11.96

CV

21.92

15.17

15.18

13.07

14.55

B: Bootstrap

CV: cross Validation

Figure 7 shows the performance of the classifier model as the number of attributes is increased. Separate training data was gathered from number 1 to 6. The number of variables increases and so does the accuracy of all the classifiers. The overhead of computing the pruning predictors remains constant. But, the extra accuracy becomes cost effective beyond 5 variables.

Fig.8. Error rate evolution with pruning

Conclusion

The present study is mainly aimed at using time domain analysis for the estimation of clinically significant parameters of ECG waveforms. This method applied detects some ECG abnormalities by using Data Mining classification methods. One main objective is to introduce, apply, and evaluate the use of Data Mining methods in differentiating between atrial fibrillation, right bundle branch block and normal signal. In this paper, several methods have been used for arrhythmia classification system with ECG signals specifically CHAID, Improved CHAID, Improved C4.5 and C4.5 Decision Trees. The classifier performance is evaluated using the bootstrap and the cross validation models for error rate estimation. The combination of CHAID Classifier and the Bootstrap with pruning by 5 attributes achieves the best performance managing to classify correctly during the validation of our model.