The Hepatitis Domain Database Biology Essay

Published:

The analyses performed within this research are based on five medical databases. The following subsections provide description of the source of data and reasons of choosing them. Also all datasets are described in detail.

2.1 Source of data

Before starting the experimental part of the research, the data is collected. Although there are lots of data available in the internet a lot of them are useless. Some of the databases contain many missing attributes and other have no documentation of even names of attributes and differentiation into conditional and decision attributes. The UCI medical data repository provides this chance for others to conduct similar experiments and compare their results. This was the main reason for selecting the UCI data repository databases [2]. The selected databases differ from each other. They belong to five different medical fields. This allows us to evaluate the algorithms' performance under various attributes features.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

The UCI Repository of Machine Learning Databases and Domain Theories is a free Internet repository of analytical datasets for several fields [2]. All datasets are in the format of text files and many researchers recognize these datasets are a precious source of data [1]. For the analyses five different medical datasets were selected. Each dataset will be described in this chapter.

2.2 Databases details description

In this chapter five datasets are described in detail. Knowing the nature of a datasets is essential in order to perform data mining analyses [1]. Number of missing values in the set is an important issue because it may distort the results of the experiment. Also the conditional and decisional attributes should be studied. All these steps are conducted for each dataset and presented in sections 2.2.1 to 2.2.5.

2.2.1 Heart disease database

The heart disease database was collected by the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation in 1988. The records in the set are categorized in to one of five angiographic disease statuses. The severity of the disease is shown with values 0, 1, 2, 3, 4, which how advanced the disease is takes higher number. The value 0 shows absence of the disease.

The database consists of 17 attributes, 13 conditional and 4 decisional. There are four conditional attributes, which take discrete natural number values from defined ranges. Two conditional attributes are binomial, one is positive real valued.

Table 2.1 Heart-disease database from Cleveland Clinic Foundation[2]

Name Of The Decision Table

Heart-Disease database Cleveland Clinic Foundation

Name Of Attributes

17

Name Of Symptoms

13

Symptoms Name And Values

Age in years

29,…,77

Sex

1=male; 0=female

Chest pain type

1=typical angina

2=atypical angina

3=non-anginal pain

4=asymptomatic

Resting blood pressure in mm/Hg

94,…,200

Serum cholestoral in mg/dl

126,…,654

Fasting blood sugar >120 mg/dl

1=true; 0=false

Resting electrocardiographic reuslts

0=normal

1=having ST-T wave abnormality

2=showing probable or definite left ventricular hypertrophy by Este's criteria

Maximum heart disease achieved

71,…,202

Exercise induced angina

1=yes; 0=no

Set depression induced by exercise relative to rest

(0,6.2)

The slop of the peak exercise ST segment

1=upsloping

2=flat

3=downsloping

Number of major vessels colored by flourosopy

1,2,3

Thal

3=normal

6=fixed defect

7=reversable defect

Number Of Diagnoses

4

Diagnosis Names And Values

Angiographic disease status

0,1,2,3,4

Number Of Instances

303

Missing Attribute Values

6

2.2.2 Hepatitis database

The Hepatitis database comes from Jozef Stefan Institute in Yugoslavia. The data was gathered in 1988 [2]. The hepatitis is because of a virus called hepatitis B virus (HBV). Early diagnosis of the disease is extremely important because unrecognized disease may lead to chronic hepatitis in 15% of cases. The disease has many symptoms [3]. The detailed information about the dataset is presented in the Table 2.2. most of the attributes are in binary format, where 1 show the presence of a symptom and 0 means absence of the symptom. The Age, Alk phosphate, Sgot and Protime are discrete attributes. Bilirubin and Albumin take continuous values. Sex and Histology are zero-one values. The decisional attribute determines whether a patient lived or died. There are a lot of missing attributes values but they belong to few records, which were removed in a preprocessing phase. The regeneration of these values was not possible.

Table 2.2 Hepatitis Domain database[2]

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

Name Of The Decision Table

Hepatitis Domain

Name Of Attributes

20

Name Of Symptoms

19

Symptoms Name And Values

Age

10,20,30,40,50,60,70,80

Sex

1=male;0=female

Steroid

0=no; 1=yes

antivirals

0=no; 1=yes

Fatigue

0=no; 1=yes

malaise

0=no; 1=yes

Anorexia

0=no; 1=yes

Liver big

0=no; 1=yes

Liver firm

0=no; 1=yes

Spleen palpable

0=no; 1=yes

Spiders

0=no; 1=yes

Ascites

0=no; 1=yes

varices

0=no; 1=yes

bilirubin

(0.39,4)

Alk phosphate

33,80,120,160,200,250

Sgot

13,100,200,300,400,500

Albumin

(2.1,6)

Protime

10,20,30,40,50,60,70,80,90

Number Of Diagnoses

Histology

0=no;1=yes

Number

1

Diagnosis Name And Values

Class

1=live;0=die

Number Of Instances

155

Single Missing Attribute Value

167

2.2.3 Diabetes database

The diabetes disease has a lot of symptoms. During diagnosing plasma glucose level is measured and this examination determines whether patient has diabetes or not. Early diagnosis of diabetes is extremely important because unrecognized disease may lead to hypertension, shock, amputation or even death [4]. The Pima Indians Diabetes Database was created in National Institute of Diabetes and Digestive and Kidney Diseases and shared in 1990 in [2]. The database includes information about patients which are females between 21 and 81 years old. The data was collected with using a unique algorithm called ADAP [2]. Detailed description of the database is described in Table 2.3.

Table 2.3 Pima Indians Diabetes Database[2]

Name Of The Decision Table

Pima Indians Diabetes Database

Number Of Attributes

9

Number Of Symptoms

8

Symptoms Name,Values,Mean,Standard Devitaion

Number of times pregnant

0,…,17 3 3

Plasma glucose concentration

Oral glucose tolerance test

0,…,199 121 32

Diastolic blood pressure(mm Hg)

24,…,122 69 19

Triceps skin fold thickness(mm)

7,…,99 21 16

2-Hour serum insulin(mu U/ml)

14,…,846 80 115

Body mass index

(18.2,67.1) 32 8

Diabetes pedigree function

(0.078,2.42) 0.47 0.33

Age in years

21,…,81 33 12

Number Of Diagnoses

1

Diagnosis Name And Values

Diabetes

0=no;1=yes

Number Of Instances

768

Single Missing Attribute Value

0

The Pima Indians Diabetes Database consists of 9 attributes: one decisional and 8 conditional. The decisional attribute is binominal. Value 1 for this attribute means that the patient was tested positive for diabetes while value 0 otherwise. All conditional attributes are numeric-valued. Six of them are natural numbers and two are real positive numbers from defined ranges. The database contains complete 768 instances what makes the analysis very precise.

2.2.4 Dermatology database

This database contains 34 attributes, 33 of which are linear valued and one of them is nominal. The differential diagnosis of erythemato-squamous diseases is a real problem in dermatology. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. Usually a biopsy is necessary for the diagnosis but unfortunately these diseases share many histopathological features as well. Another difficulty for the differential diagnosis is that a disease may show the features of another disease at the beginning stage and may have the characteristic features at the following stages. Patients were first evaluated clinically with 12 features. Afterwards, skin samples were taken for the evaluation of 22 histopathological features. The values of the histopathological features are determined by an analysis of the samples under a microscope.

In the dataset constructed for this domain, the family history feature has the value 1 if any of these diseases has been observed in the family, and 0 otherwise. The age feature simply represents the age of the patient. Every other feature (clinical and histopathological) was given a degree in the range of 0 to 3. Here, 0 indicates that the feature was not present, 3 indicates the largest amount possible, and 1, 2 indicate the relative intermediate values.The names and id numbers of the patients were recently removed from the database.

Table 2.4 Dermatology Database[2]

Name Of The Decision Table

Dermatology Database

Number Of Attributes

39

Number Of Symptoms

33

Symptoms Name And Values

Erytema

0,1,2,3

Saling

0,1,2,3

Definite borders

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

0,1,2,3

Itching

0,1,2,3

Koebner phenomenon

0,1,2,3

Polygonal papu les

0,1,2,3

Follicular papu les

0,1,2,3

Oral mucosal involvement

0,1,2,3

Sclap involvement

0,1,2,3

Family history

0=no;1=yes

Age

7,…,75

Melanin incontinence

0,1,2,3

Eosinophils in infiltrate

0,1,2,3

PNL infiltrate

0,1,2,3

Fibrosis of the papillary dermis

0,1,2,3

Exocytosis

0,1,2,3

Acanthosis

0,1,2,3

Hyperkeratosis

0,1,2,3

Parakeratosis

0,1,2,3

Clubbin of the rete ridges

0,1,2,3

Elongation of the rete ridges

0,1,2,3

Thinning of the suprapapillary epidermis

0,1,2,3

Spongiform pastule

0,1,2,3

Munro microabcess

0,1,2,3

Focal hyper granulosis

0,1,2,3

Disappearance of the granular layer

0,1,2,3

Vacuolization and damage of basal layer

0,1,2,3

Spongiosis

0,1,2,3

Saw-tooth appearance of retes

0,1,2,3

Follicular horn plug

0,1,2,3

Perifollicular parakeratosis

0,1,2,3

Inflammatory monoluclear inflitrate

0,1,2,3

Band-like infiltrate

0,1,2,3

Number Of Diagnoses

6

Diagnosis Name And Values

Psoriasis

Class code=1

Seboreic dermatitis

Class code=2

Lichen planus

Class code=3

Pityriasis rosea

Class code=4

Cronic dermatitis

Class code=5

Pityriasis ruba pilaris

Class code=6

Number Of Instances

366

Single Missing Attribute Values

8

2.2.5 Breast cancer database

These data have been obtained by means of an image analysis system developed at the University of Wisconsin [2] and contains real observations of 569 oncological instances gathered in 1995. The conditional attributes describe information gained from the digitalized images of the breast mass. Each examination is characterized by nine attributes whose values are between 1 and 10. The decision attribute denotes malignancy of the disease (malignant or benign).

Table 2.5 Wisconsin Diagnostic Breast Cancer (WDBC)[2]

Name Of The Decision Table

Wisconsin Diagnostic Breast Cancer(WDBC)

Number Of Attributes

11

Number Of Symptoms

9

Symptoms Name And Values

Clump thickness

1-10

Uniformity of cell size

1-10

Uniformity of cell shape

1-10

Marginal adhesion

1-10

Single Epithelial cell size

1-10

Blan chromatin

1-10

Normal nucleoli

1-10

Mitoses

1-10

Number Of Diagnoses

2

Diagnosis Name And Values

Malignant

4

Benign

2

Number Of Instances

699

Single Missing Attribute Values

16

It is a special database. The conditional attributes have similar value ranges between 1 and 10. The decision attribute is binominal.

Refrences

[1] Witten I. H., Frank E., Data Mining, Practical Machine Learning Tools and Techniques, 2nd Elsevier, 2005

[2] Newman D.J., Hettich S., Blake C.L., Merz C.J., UCI Repository of machine learning databases. 1998 [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science

[3] Ryder S. and Beckingham I., ABC of diseases of liver, pancreas, and biliary system: Acute hepatitis.2001, 151-153

[4] Nathan D.M., Cleary P.A., Backlund J.Y., Genuth S.M., Lachin J.M., Orchard T.J., Raskin P. and Zinman B., Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications (DCCT/EDIC) Study Research Group. Intensive diabetes treatment and cardiovascular disease in patients with type 1 diabetes. The New England Journal of Medicine, 2005, vol. 353, 2643- 2653.