A Method of Improving Medical Data Mining

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

EXPERIMENT DESIGN

Test Set-Up

The tests are carried out in a system with Intel i5, 8GB RAM, DDR3, 500GB hard drive on a Windows XP operating system. The proposed algorithm is implemented using MATLAB – R2009a. The stepwise approach is as follows. The input to the system is given in the Comma Separated Value (CSV) format. Two files are provided as inputs – one with the attributes alone and the other one with the class labels. The proposed algorithm is executed and the features in the ranked order are obtained as the output. The classifiers are tested out with the selected attributes given in attribute-relation file format (ARFF) file. A table is created in Oracle using the name specified in \@relation”. The attributes specified under \@attribute” and instances specified under \@data” are retrieved from the ARFF file and then they are added to the created table. 10-fold cross validation is performed for all the classifiers. Number of runs is equal to number of features present in the dataset and each classification algorithm with Improved Normalized Point wise Mutual Information i.e NB-INPMI, SVM-INPMI, J48-INPMI were recorded . In each run, a dataset was slit into training and testing set, randomly.

Dataset Used - Erythemato-Squamous Disease

The differential diagnosis of erythemato-squamous diseases is a difficult problem in dermatology. The erythemato-squamous diseases20 all share the clinical features of erythema and scaling symptoms, with very little differences. The diseases in this family are chronic dermatitis, psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, and pityriasis rubra pilaris. The datasets consisting 34 features are used in our study and the six classes of erythemato-squamous diseases. It collects 366 samples presenting 34 attributes: each sample contains 12 clinical features (i.e. age, family history and so on) and 22 histopathological features obtained after a biopsy of a patient skin sample and one class attribute. In the dataset, the family history feature assumes value 1 or 0 depending if the disease has been or not observed in the family. Other clinical and histopathological features can assume, instead, a degree in the range 0-3: value 0 indicates the absence of the particular feature, the largest possible amount of feature is represented by degree 3, while 1 and 2 denote intermediate values. The detailed description of Erythemato-Squamous Disease is given in Table 1.

Table 1. Detailed Description of Erythemato-squamous disease

Patients Records of

Erythemato-squamous disease

Features Description

Clinical Histopathological

Psoriasis (111)

Seboreic dermatitis (60)

Lichen planus (71)

Pityriasis rosea (48)

Chronic dermatitis (48)

Pityriasis rubra pilaris (20)

1 : Erythema

2: Scaling

3: Definite Borders

4: Itching

5:Koebner Phenomenon

6: Polygonal Papules

7: Follicular Papules

8: Oral Mucosal

Involvement

9: Knee And Elbow

Involvement

10:Scalp Involvement

11:Family History, (0 Or 1)

34:Age

12 : Melanin Incontinence

13:Eosinophils In The Infiltrate

14: PNL Infiltrate

15:Fibrosis Of The Papillary Dermis

16: Exocytosis

17: Acanthosis

18: Hyperkeratosis

19: Parakeratosis

20:Clubbing Of The Rete Ridges

21: Elongation Of The Rete Ridges

22:Suprapapillary Epidermis thinning

23: Spongiform Pustule

24:Munro Microabcess

25:Focal Hypergranulosis

26:Disappearance Of The Granular Layer

27:Basal Layer (Vacuolisation And Damage )

28:Spongiosis

29:Saw-Tooth Appearance Of Retes

29:Follicular Horn Plug

30.Follicular Horn Plug

32:Inflammatory Monoluclear Inflitrate

33: Band-Like Infiltrate

Setting Parameters For Accuracy Estimators

Parameter Setting For NB Model

Naive Bayes (NB) is one of the oldest classifiers9 .It is obtained by using the Bayes rule and assuming features are independent of each other given its class. It’s one of the simplest classifier but it can often outperform more sophisticated classification methods. It handles both discrete or continuous variables.

The choices of either enable and disable is done for the following parameters like

  • Display Model In Old Format .
  • Kernel Estimator - Kernel density function for numeric attributes rather than a normal distribution.
  • Supervised Discretization - to convert numeric attributes to nominal ones.

Parameter Setting For SVM Model

Support Vector Machine18 is a comparatively new and promising classification method among all the other methods. In our case SVM performs best among the two other classifiers we used for classification. The following parameters are set for the SVM model.

  • Build logistic models - to fit logistic models to the outputs (for proper probability estimates).
  • The complexity parameter C= 1.0
  • Debug mode.
  • Epsilon.
  • Kernel -RBF kernel.
  • NumFolds – 10
  • Tolerance parameter - 0.001.

Parameter Setting For Decision Tree-J48 Model

The J48 Decision tree classifier operates on the basis of constructing a tree and branching it based on the attribute with the highest information gain. For constructing the tree the following parameters are set

  • Binary Splits - binary splits on nominal attributes when building the trees.
  • Confidence Factor - 0.25.
  • Debug mode.
  • MinNumObj - 2 per leaf.
  • NumFolds – 10.
  • Reduced Error Pruning
  • Laplace mode - Whether counts at leaves are smoothed based on Laplace.

RESULTS AND DISCUSSIONS

We used WEKA9 to measure the performance of INPMI feature selection algorithm. WEKA is a well known machine learning tool based on JAVA. And we evaluated selected feature subsets using three learning algorithms – INPMI -Naïve Bayes (NB), INPMI-SVM algorithm and INPMI- J48. We evaluated feature subsets using 10-fold Cross- Validation (CV) for erythemato-squamous diseases from UCI data source19 .

Accuracy Of Classification

We measured the classification accuracy of datasets with full features first using the learning algorithms by 10-fold CV. Then we applied our INPMI feature selection algorithm to find the best feature subset. Then reorganize the datasets using selected (reduced) features and evaluated by INPMI -Naïve Bayes (NB), INPMI-SVM algorithm and INPMI- J48 by the same process. We measured the accuracy of those reorganized datasets, and calculated the difference of accuracies between the reorganized datasets and the full-featured datasets. Figure 2 shows us the result. Almost all of the methods produced better performance with reduced feature set than full features. In particular INPMI-SVM method performed better than other methods for erythemato-squamous diseases datasets.

To evaluate the effectiveness of INPMI method, we conducted experiments on the diagnosis of erythemato-squamous diseases. The significance of each feature is measured by the Improved Normalized point wise mutual information and the results on different training sets are shown in Table I. The measure of importance of each feature from high to low is 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2,23,11,1,18,17,13,32.So we use sequential forward search procedure according to the importance of each feature (that is to say, based on the values of the improved NPMI) to construct 34 models with different feature subsets shown in Table 2. Table 3 shows the NB classification accuracies on testing data for the 34 models. Table 4 shows the SVM classification accuracies on testing data for the 34 models. Table 5 shows the J48 classification accuracies on testing data for models. Among the 34 models, model #22 achieved the highest classification accuracy, 98.36% for NB, 98.90% for SVM and 94.53% for J48 with tenfold cross validation. Therefore, #22 is considered as the best feature subset. For comparison purpose, Table 6 gives comparison of classification accuracies of our method and previous research methods. We can observe that our INPMI-SVM a novel hybrid feature selection method can obtain far better classification accuracy than IFSFS-SVM15 .Therefore we can conclude that our method obtains promising results for diagnosis of erythemato- squamous diseases. By analysing the result, INPMI-SVM model gives good results for diagnosing the erythemato-squamous diseases and proposed method can be tested and applied on real-world dataset. The overall performance estimation is shown in figure 2.

Table 2. The selected Feature Subset for Erythemato squamous disease based on Hybrid INPMI Feature Selection

Estimator # selected Selected feature subset

Model (M) features

#M1 1 21

#M2 2 21,33

#M3 3 21,33,20

#M4 4 21,33,20,15

#M5 5 21,33,20,15,28

#M6 6 21,33,20,15,28,29

#M7 7 21,33,20,15,28,29,22

#M8 8 21,33,20,15,28,29,22,27

#M9 9 21,33,20,15,28,29,22,27,34

#M10 10 21,33,20,15,28,29,22,27,34,9

#M11 11 21,33,20,15,28,29,22,27,34,9,25

#M12 12 21,33,20,15,28,29,22,27,34,9,25,16

#M13 13 21,33,20,15,28,29,22,27,34,9,25,16,12

#M14 4 21,33,20,15,28,29,22,27,34,9,25,16,12,6

#M15 15 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5

#M16 16 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14

#M17 17 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8

#M18 18 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10

#M19 19 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26

#M20 20 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24

#M21 21 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31

#M22 22 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3

#M23 23 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7

#M24 24 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30

#M25 25 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19

#M26 26 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4

#M27 27 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2

#M28 28 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2,23

#M29 29 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2,23,11

#M30 30 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2,23,11,1

#M31 31 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2,23,11,1,18

#M32 32 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2,23,11,1,18,17

#M33 33 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2,23,11,1,18,17,13

#M34 34 21,33,20,15,28,29,22,27,34,9,25,16,12,6,5,14,8,10,26,24,31,3,7,30,19,4,2,23,11,1,18,17,13,32

FIG. 2 PERFORMANCE ESTIMATION OF INDIVIDUALLY RANKED FEATURES BY NB, SVM, J48 CLASSIFIERS

Table 3. Estimated Accuracy of NB Model

Classifier Model

Ranked features

Specificity

Sensitivity

False Positive Rate

(FPR)

Accuracy

ROC

#M1

#M2

#M3

#M4

#M5

#M6

#M7

#M8

#M9

#M10

#M11

#M12

#M13

#M14

#M15

#M16

#M17

#M18

#M19

#M20

#M21

#M22

#M23

#M24

#M25

#M26

#M27

#M28

#M29

#M30

#M31

#M32

#M33

#M34

21

33

20

15

28

29

22

27

34

9

25

16

12

6

5

14

8

10

26

24

31

3

7

30

19

4

2

23

11

1

18

17

13

32

0.848

0.848

0.893

0.948

0.955

0.955

0.953

0.956

0.955

0.958

0.96

0.96

0.963

0.963

0.963

0.982

0.982

0.982

0.982

0.985

0.985

0.986

0.986

0.986

0.986

0.986

0.986

0.986

0.986

0.985

0.986

0.987

0.987

0.987

0.503

0.503

0.648

0.751

0.787

0.784

0.781

0.784

0.776

0.803

0.806

0.806

0.811

0.811

0.809

0.918

0.918

0.918

0.918

0.934

0.934

0.945

0.945

0.94

0.94

0.94

0.94

0.94

0.94

0.937

0.94

0.943

0.943

0.945

0.152

0.152

0.107

0.052

0.045

0.045

0.047

0.044

0.045

0.042

0.04

0.04

0.037

0.037

0.037

0.018

0.018

0.018

0.018

0.015

0.015

0.014

0.014

0.014

0.014

0.014

0.014

0.014

0.014

0.015

0.014

0.013

0.013

0.013

50.2732

50.2732

64.7541

75.1366

78.6885

78.4153

78.1421

78.4153

77.5956

80.3279

80.6011

80.6011

81.1475

81.1475

80.8743

91.8033

91.8033

91.8033

91.8033

93.4426

93.4426

94.5355

94.5355

93.9891

93.9891

93.9891

93.9891

93.9891

93.9891

93.7158

93.9891

94.2623

94.2623

94.5355

0.783

0.783

0.874

0.923

0.935

0.942

0.94

0.944

0.94

0.947

0.952

0.952

0.951

0.951

0.951

0.971

0.971

0.971

0.971

0.976

0.976

0.978

0.978

0.975

0.975

0.975

0.975

0.975

0.975

0.974

0.975

0.977

0.976

0.976

Table 4. Estimated Accuracy of SVM Model

Classifier Model

Ranked features

Specificity

Sensitivity

False Positive Rate

(FPR)

Accuracy

ROC

#M1

#M2

#M3

#M4

#M5

#M6

#M7

#M8

#M9

#M10

#M11

#M12

#M13

#M14

#M15

#M16

#M17

#M18

#M19

#M20

#M21

#M22

#M23

#M24

#M25

#M26

#M27

#M28

#M29

#M30

#M31

#M32

#M33

#M34

21

33

20

15

28

29

22

27

34

9

25

16

12

6

5

14

8

10

26

24

31

3

7

30

19

4

2

23

11

1

18

17

13

32

0.848

0.848

0.892

0.95

0.959

0.966

0.967

0.967

0.967

0.974

0.974

0.974

0.975

0.974

0.973

0.993

0.994

0.994

0.994

0.995

0.996

0.997

0.997

0.997

0.997

0.997

0.997

0.996

0.996

0.997

0.996

0.996

0.996

0.995

0.503

0.503

0.642

0.757

0.803

0.809

0.817

0.817

0.822

0.858

0.858

0.858

0.858

0.858

0.85

0.962

0.967

0.97

0.97

0.975

0.978

0.984

0.981

0.981

0.984

0.984

0.981

0.978

0.978

0.981

0.978

0.975

0.973

0.97

0.152

0.152

0.108

0.05

0.041

0.034

0.033

0.033

0.033

0.026

0.026

0.026

0.025

0.026

0.027

0.007

0.006

0.006

0.006

0.005

0.004

0.003

0.003

0.003

0.003

0.003

0.003

0.004

0.004

0.003

0.004

0.004

0.004

0.005

50.2732

50.2732

64.2077

75.6831

80.3279

80.8743

81.694

81.694

82.2404

85.7923

85.7923

85.7923

85.7923

85.7923

84.9727

96.1749

96.7213

96.9945

96.9945

97.541

97.8142

98.3607

98.0874

98.0874

98.3607

98.3607

98.0874

97.8142

97.8142

98.0874

97.8142

97.541

97.2678

96.9945

0.784

0.784

0.877

0.931

0.95

0.963

0.963

0.963

0.963

0.975

0.976

0.976

0.977

0.977

0.977

0.995

0.998

0.998

0.998

0.999

0.999

0.999

0.999

0.999

0.999

0.999

0.999

0.999

0.999

0.999

0.999

0.999

0.999

0.999

Table 5. Estimated Accuracy of J48 Model

Classifier Model

Ranked features

Specificity

Sensitivity

False Positive Rate

(FPR)

Accuracy

ROC

#M1

#M2

#M3

#M4

#M5

#M6

#M7

#M8

#M9

#M10

#M11

#M12

#M13

#M14

#M15

#M16

#M17

#M18

#M19

#M20

#M21

#M22

#M23

#M24

#M25

#M26

#M27

#M28

#M29

#M30

#M31

#M32

#M33

#M34

21

33

20

15

28

29

22

27

34

9

25

16

12

6

5

14

8

10

26

24

31

3

7

30

19

4

2

23

11

1

18

17

13

32

0.848

0.848

0.893

0.95

0.959

0.964

0.964

0.964

0.964

0.965

0.97

0.971

0.97

0.969

0.969

0.99

0.992

0.992

0.992

0.997

0.997

0.998

0.998

0.998

0.998

0.996

0.994

0.994

0.994

0.994

0.993

0.994

0.993

0.992

0.503

0.503

0.648

0.76

0.803

0.795

0.792

0.792

0.795

0.814

0.836

0.839

0.833

0.828

0.831

0.948

0.956

0.956

0.959

0.986

0.984

0.989

0.986

0.986

0.989

0.981

0.973

0.97

0.97

0.97

0.967

0.97

0.962

0.956

0. 152

0.152

0.107

0.05

0.041

0.036

0.036

0.036

0.036

0.035

0.03

0.029

0.03

0.031

0.031

0.01

0.008

0.008

0.008

0.003

0.003

0.002

0.002

0.002

0.002

0.004

0.006

0.006

0.006

0.006

0.007

0.006

0.007

0.008

50.2732

50.2732

64.7541

75.9563

80.3279

79.5082

79.235

79.235

79.5082

81.4208

83.6066

83.8798

83.3333

82.7869

83.0601

94.8087

95.6284

95.6284

95.9016

98.6339

98.3607

98.9071

98.6339

98.6339

98.9071

98.0874

97.2678

96.9945

96.9945

96.9945

96.7213

96.9945

96.1749

95.6284

0.791

0.791

0.871

0.929

0.945

0.936

0.935

0.935

0.935

0.938

0.952

0.952

0.95

0.948

0.949

0.984

0.987

0.987

0.988

0.995

0.995

0.996

0.996

0.996

0.996

0.993

0.99

0.989

0.989

0.989

0.987

0.989

0.986

0.985

Table 6. Comparison of Classification accuracy of our method INPMI with other classifiers from literature

AUTHOR

Methods Applied

Classifier Accuracy Obtained

Ubeyli and Guler (2005)

ANFIS

95.50

Luukka and Leppalampi

(2006)

Fuzzy similarity-based

classification

97.02

Polat and Gunes (2006)

Fuzzy weighted pre-processing

K-NN based weighted preprocessing

Decision tree

88.18

97.57

99.00

Nanni (2006)

LSVM

RS

B1_ 5

B1_10

B1_ 15

B2_5

B2_10

B2_15

97.22

97.22

97.50

98.10

97.22

97.50

97.80

98.30

Luukka (2007)

Similarity measure

97.80

Ubeyli (2008)

Multiclass SVM with the ECOC

98.32

Polat and Gunes_ (2009)

C4.5 and one-against-all

96.71

Ubeyli (2009)

CNN

97.77

Liu et al. (2009)

Naive Bayes

1-NN

C4.5

RIPPER

96.72

92.18

95.08

92.20

Karabatak and Ince (2009)

ARandNN

98.61

Juanying Xie et al (2011)

IFSFS-SVM

98.61

Our Method INPMI for

Erythemato-squamous disease.

INPMI-NB

INPMI-SVM

INPMI-J48

98.36

98.90

94.53

CONCLUSION

In this paper, we have proposed an efficient Improved Normalized Point wise Mutual Information (“INPMI”) applicable to medical data mining. Empirical study on erythemato- squamous diseases medical datasets suggest that INPMI gives better over-all performance than the existing counterparts in terms of all three evaluation criteria, i.e., number of selected features,classificationaccuracy,and computational time. The comparison to other methods in the literature also suggests INPMI has competitive performance. INPMI is capable of eliminating irrelevant and redundant features based on both feature subset selection and ranking models effectively, thus providing a small set of reliable features for the physicians to prescribe further medications. It seems that the classification performance is necessarily proportional to the removal of redundant features, heavily dependent on the inclusion of relevant features and the “accuracy” metric is observed as maximum with maximised sensitivity and specificity. The proposed INPMI algorithm operates invariably well on any type of classifier model. This shows the generalization ability and applicability of the proposed system. The best accuracy rate (98.90% for SVM classifier) achieved by our proposed system is superior to the existing schemes. This shows the effectiveness of the system. The future work includes still improving the performance as well as the scalability of the proposed system using appropriate fusion techniques like Particle Swarm Optimization (PSO), Ant Bee Colony Optimization (ABC), and Genetic Algorithm (GA) .

Acknowledgements: This work is supported in part by the University Grant Commission Major Research Project under grant no. F.No.:39-899/2010 (SR).

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.