Prediction System Using Data Mining Techniques Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Lymphoma cancer is a disease of white blood cell that passes Lymph system and one of the most common deadly diseases in the world. Mining structure with applied computational intelligence performs an automatic analysis on the patient's data from discovery knowledge for medical decision making. To develop the Lymphoma Cancer Survivability Prediction System (LCSPS) model, a popular data mining algorithm technique is used namely Decision Tree method. This paper describes a technique on classification and attributes extraction of a lymphoma cancer from a large dataset. The dataset consists of all the clinical information of the patients that were suspected of having blood cancer lymphoma and also known as Leukemia. The results accuracy classification was obtained from Neural Network. In addition, this paper provides the prediction with a high degree of accuracy of survival rate on the diagnosis. LCSPS is design to be implemented in Client Server architecture.

Keywords: Lymphoma Disease, Decision Tree, Data Mining, Architectural Design, Classification

In this paper 5 factors were used cross-validation to measure the results and performance.

The result indicated that the Decision Tree is the best predictor with xx% accuracy.

1. Introduction

Lymphoma is a cancer which is originated of the lymphatic system and one of the diseases that uses a Clinical Decision Support System (CDSS). This system is to support the clinical process by choosing an appropriate model to measure the effectiveness and efficiency of the survival rate for the diagnosed patient. The lymphoma has divided into two categories: Hodgkin Lymphoma and all other lymphoma.

2. Problem Statement (1-2para)

-Which data Techniques

-Why u choose this techniques.

f there is an alogirthm what algorithm used.. whose algorithm is appilied.

Eagles Wing Hospital has large patient information but these dataset are not helpful for the medical team to predict the patients' survival rate after disease was diagnosed and provide treatment based on the stage. Usually the doctors make clinical decision based on their experience rather than using expert tool. Such practice leads to wrong decision making that place the patient in high risk of survival. Besides that, it also affects on service quality provided to the patient and prestige of the hospital. Predicting the outcome of a disease is very complex and challenging o develop a data modeling and analysis methods like data mining techniques which can assist in decision making. The development of medical knowledge and elicitate of knowledge base maintenance become another major problem. Most of the medical team are facing problem to make a proper clinical decision adequately within a short period even though the patients data is available in the database.

This problem is due to insufficient time available for diagnosis and treatment. So the doctors and nurses can make recommendations on diagnoses and treatment but can predict on the survival status Moreover, the doctors and nurse are unable to follow up with the fast development in the clinical decision making and improve quality in patient care. Besides that, the inconsistency and various medical application of knowledge lead to poor decision making.

This is the reason why knowledge cannot be easily maintained and not easy to incorporate because the service cannot be integrated with a new CDSS application. In other words, the costs of the medical care are also increasing due to tremendous amount of data and rules are incorporated and the patients' data or cases stored in the database become obsolete.

3. Research Objective

The objective of the CDSS is developed LCSPS to assist the medical team to make more intelligent clinical decisions using three data mining modeling techniques. The techniques are Decision Tree, Neural Network and Clustering. The reason to develop this prototype is to find out the best model using the algorithm from the SEER Lymphoma cancer database. This is to improve the decision making process which is accurate and precise of survivability rate for Lymphoma.

This system is developed to provide an intelligent and effective Lymphoma survival prediction system. LCSPS provides integrative and personalized support for clinical to assists medical team in all tasks from clinical workflow for example data collection, diagnosis, symptoms, treatment, vital status record which gathered from patient data.

The targeted audience for LCSPS is the medical team such as doctors, interns, nurses and physician.

4. Literature Review

(This is about the clinical decision support - Data techniques and the algorithm used)

There are many systems that have been developed since previous years to provide support in clinical decisions making for various types of cancer - one of those is lymphoma cancer. In which some systems like Medic Exchange, CADUCEUS and Internist-I have been used in some health care institutes around the world, to assist medical team with prediction of cancer risk. The focus of this literature review diverged into the studies over the use of methodology, algorithm, Clinical Decision Support (CDS), architecture, and functionalities of existing systems.

The data mining concepts are extensively used for many sector namely clinical, banking, finance and research. However, based on the study the clinical data mining has not become popular due the complexity to support data modeling using the large dataset.

This paper discusses total of 8089 patients with lymphoma cancer between 1999 and 2005. All these patients had been diagnosed with Chronic Lymphocytic Leukemia Lymphoma and others [4,5].

According to the SEER Statistic fact sheets, it is estimated that XXXXX male and female had been diagnosed of Lymphoma Cancer. The survival rate had been calculated by different attributes which measures the survival of the cancer patients.

5. Requirement Specification

Several meetings were held to identify requirement specifications with physicians, nurses and other health care professionals for Lymphoma Cancer in Eagle Wing's Klang Valley Branch. Use Case analysis and Unified Modelling Language (UML) were used to gather functional requirements and other requirements. A conceptual model was built through a three-step process:

1) Development of Use Cases,

2) Development of a data model using UML, and

3) Validation of the data model.

The platform is aim at helping practicing the administrative staff and nurses meet their clinical and practice-related needs. It will provide access to information, products, services, and resources to help facilitate medical practice and ease adoption of health information technology. There are three types of database can be considered in developing the Clinical Decision System for Lymphoma Cancer for Eagle Wing's Klang Valley. There is Microsoft Access, Microsoft SQL Server and SQL Server Analysis Services There are:

Microsoft Access is a data management application that allows information storage in tables that it manages directly from the local disk. It can be also be used to interface information that is located elsewhere and handled by another storage management system. In this case, Access acts as a client that connects to a server that provides the data.

Microsoft SQL Server is an application used to create computer databases for the Microsoft Windows family of server operating systems. Microsoft SQL Server provides an environment used to generate databases that can be accessed from workstations, the Internet, or other media such as a personal digital assistant (PDA).

SQL Server Analysis Services (SSAS) enables BI (Business Intelligence) workers to work on multi-dimensional data in SQL Server which can be gathered from different kinds of resources like flat files to relational databases. SSAS has the ability to analyze data grouped and aggregated into different formats and views like the faces of a cube. A SSAS application adds the value of data analysis and represents the data in the format of OLAP cubes, OLAP reporting or data mining features.

5.2. Functional and Non Functional


Functional requirement is the statements of services the system should provide, how the system should react to particular input and how the system should behave in particular situations [6]. More information about the requirements is capture in the use case and table below.

Non-functional requirements are constraint on the services or functions offered by the system as timing constraints, constraints on the development process, standards [6].

5.4. Bench Marking (Comparison)

6. Methodology and Model

The LCSPS is initially made from scratch which includes methods below explaining the steps taken throughout the entire system development:

Choose the Lymphoma data from the SEER database.

Identify data field based on the file format from the SEER data dictionary.

Converted the LYMYLEUK text file into excel then export into SQL Server Business Intelligent Development selected database.

Eliminate irrelevant fields for prediction. After remove the irrelevant fields this data are uploaded into the lymphoma database.

Then choose the key, input and prediction input.

Choose the data mining model and generate the Decision Tree, Clustering and Neural Network.

Testing and validate the ready system for accuracy.

Collecting results obtained for proofs of study.

Generate report to document system outcomes.

6.1 Data Mining Models

In Microsoft Visual Studio software, Data Mining Extension (DMX) query language was used for mining structure, mining models, mining models viewer, mining accuracy chart and mining model prediction.

The Lymphoma data structure parameter was set by selecting the attributes (related data) from the database. The data mining has four parameter setting attributes that is Ignore attribute, Input attribute, Predictable attribute, Predict Only attribute and Key attribute.

The mining model viewer will build or evaluate against the lymphoma dataset for accuracy and effectiveness before they deploy the LCSPS.

Validating model effectiveness.

6.2. Data Source

In this paper, the SEER incidence and population data of Cancer database were used to develop the models that predict the survivability of diagnosed cases for Lymphoma Cancer. In this SEER data that is the most comprehensive source of information on cancer diagnosis, treatment, symptoms and survival in United States [5].

The SEER data consists of eight to nine text files that each text files containing all the cancer related information for example Breast Cancer, Lung Cancer, Lymphoma Cancer, Urinary and all others. In the SEER data that consists of 27 columns of information that each cancer data related to a specific information concerning an incidence cancer. The SEER database consists of patient's registry, id, socio-demographic details such as race, gender, marital status and age, diagnosed month and year, treatment, vital status and all others [5].

The SEER Lymphoma Cancer data consists of 328,191 and 72 variables but there are 56 variables are not related to Lymphoma Cancer cases. Each record presents a case with a patient id and registry id. The patient id is the unique and primary record to differentiate the cases.This text files were converted the data into excel file and uploaded into the SQL server database.

Since the goal of this data mining project is to develop models of predicting the survival of an incidence of Lymphoma Cancer, a key variable is created for representing and calculating the survival variable.

Furthermore, the dataset has the missing data, inconsistent data and incorrect data. While converting this dataset, all the irrelevant data such as above were deleted from the dataset to ensure the analysis of the predict model is accurate.

-what happen to the missing data /

-Avearge of surrounding data

Source of the data


Quality of the data

7. System Implementation (60%)

The raw data is imported from various sources and is transferred to the data warehouse for further analysis. The updates of the data are from external and different source into data warehouse. The Figure 7.1 shown below illustrates the data architecture for the Lymphoma Cancer.


Figure 7.1: Lymphoma Cancer Data Warehouse

The process will routinely update into the data warehouse that is extracted from external source and it is also called as Data provisioning. These data are extracted from different department within the hospital to load the data into the Lymphoma staging database. In this process, the data cleansing will be also take place to segregate quality data that is shown in Figure 7.2. Total of 337,041 was the noise data that was removed from Lymphoma database. The remaining 8208 cases are identified to perform the data mining.

Figure 7.2: Data Cleansing Process

These data loaded into relational database called SQL Server Database Engine and the data will be viewed from SQL Server Business Intelligent Development tool using SQL Server Analysis System (SSAS). The data warehouse consists of collection of patient data for Lymphoma database. The raw data consists of the Patient ID Number, Marital Status, Sex, Age at Diagnosis, Primary Site, Histology, Histology Type, Grade, Diagnosis Confirmation and Survival Year details that is shown in Figure 7.3


Figure 7.3: Lymphoma field description table

The objective of the data query is to store the data into database engine, whereby the data are query put in the data mining technique for processing and reporting. The data can be queried and displayed in the prediction view as illustrates in Figure 7.4.


Figure 7.4: Lymphoma query database results


Figure 7.5: Lymphoma Neural Network Algorithm Mining Value


Figure 7.6: Lymphoma Neural Network Mining Model Viewer


Figure 7.7: Lymphoma Neural Network Mining Model Prediction


Figure 7.8: Neural Network Algorithm Query

8. Results and Interpretation


Figure 8.1: Neural Network Algorithm Survival Year Prediction Results


Figure 8.1: Lymphoma Decision Tree Mining Model Viewer


Figure 8.2: Lymphoma Clustering Mining Model Viewer

9. Evaluation and Limitation

Evaluation is to assess the quality, value, effects and impacts of information technology and applications in the health care environment, to improve health information applications and to enable the emergence of an evidence-based health informatics profession and practice. (Ammenwerth et al, 2004).

Evaluation studies of Clinical Decision Support System for Lymphoma Cancer for Eagle Wing's Klang Valley Branch have aimed to measure the impact of a system on a well-delineated and limited part of the process. Systems that have been relatively frequently evaluated have been designed, for example, to provide support for diagnosis, disease management, drug management, preventive interventions.

Other evaluation studies have included the impact of a system on the quality of decision making, impact on clinical actions, usability, integration with workflow, the quality of the clinical advice offered. The cost effectiveness of Clinical Decision Support System and their ability to help improve clinical outcomes have been relatively infrequently evaluated.

Figure 7:

Evaluation studies covered include:

• clinical impact

• impact on working practices

• usability

• knowledge content

• system requirements

• technical issues

• interoperability

• managing an evaluation.

Physicians, doctor, nurses and health care professionals in Eagle Wing's Klang Valley Branch are been evaluate to completed the scenarios.

The user interface of Clinical Decision Support System is using open standards to make it easier for physicians to use the program. The reason for physicians to like to use the tools if they did not require additional training. For that reason, we did not provide instruction in the use of the Clinical Decision Support System prior to the scenario testing. In this way, could determine whether the system was intuitive and easy to use.

After establishing normality, the mean total scores and individual components of the scores for the lymphoma clinical scenarios were analyzed.

The limitation of the Clinical Decision Support System for Lymphoma Cancer for Eagle Wing's Klang Valley Branch are cover only a narrow field of medical knowledge and exhibit a significant decline of their performance. Secondly, the scope is limited only by the underlying capabilities of the programming languages and data models used.

Clinical Decision Support System cannot represent the rich variety of diagnostic and therapeutic reasoning strategies that clinicians use to solve complex patient problems. These limitations are especially critical as the computer systems fail to recognize internally when their results become erroneous.

Limitations in the user interface, Clinical Decision Support System relies on computable input data, which represent just a small proportion of the information required to make clinical decisions. It is extremely difficult for the user to determine whether the input data adequately represent the patient's clinical problem.

Clinical Decision Support System sometimes fails to represent common-sense knowledge, result and have no real understanding of the patient's problem. The correctness of the systems' advice cannot be guaranteed.

10. Conclusion and Future Work

The value of knowledge can be recognized only if it is used effectively and efficiently. This paper to develop a prediction model in combination with Clinical Decision Support to improve the quality of Lymphoma diagnosis and treatment in order to reduce both required time and cost. Overall this prediction model not only acceptable by medical team but also contribute a great help to the targeted medical team.