Study On Data Mining And Knowledge Engineering Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material. It is a iterative sequence of data cleaning, integration, selection, transformation and applying intelligent methods to extract data patterns. Later on, these patterns are evaluated and knowledge is presented. Data is prepared for mining through preprocessing, where redundancy and inconsistency is removed and transformed into relevant data. Then data mining identifies the truly interesting patterns based on given measures. (Han and Kamber 2006)

The successful application of data mining in highly visible fields like e-business, marketing and retail have led to the popularity of its use in knowledge discovery in databases (KDD) in other industries and sectors. This literature review intend to provide a survey of current techniques of , using data mining tools in different areas of application. It also discusses critical issues and challenges associated with data mining in general. The survey found a growing number of data mining applications, including analysis of Phishing websites, diagnosis of tumor and improving coaching strategies in cricket . It enumerate the current uses and highlight the importance of data mining in field of health , banking and sports.


Modelling Intelligent Phishing Detection System for e-Banking using Fuzzy Data Mining

Background info of the organization and target application

Phishing is a criminal endeavor intended to fraudulently acquire sensitive information by impersonating as a legitimate entity in an electronic communication. It dodges people into revealing confidential information largely using conspicuous faux email leading to a forged website. Increased sophistication, in imitating legitimate website, for deception is also augmenting ambiguities and demands subjective consideration during assessing the website. Therefore, Fuzzy Data Mining (DM) would be most apt tool to classify and identify the website. In this paper, we present novel approach to overcome the 'fuzziness' in the e-banking phishing website assessment and propose an intelligent resilient and effective model for detecting e-banking phishing websites. (Aburrous, Hossain et al. 2010)

Description of the Data used in mining exercise

E-banking phishing website detection rate is performed based on six criteria's divided into 3 layers


URL & Domain Identity

Using the IP Address Layer

Abnormal Request URL

Abnormal URL of Anchor

Abnormal DNS record

Abnormal URL


Security & Encryption

Using SSL certificate


Abnormal Cookie

Distinguished Names Certificate(DN)

Source Code & Java script

Redirect pages

Straddling attack

Pharming Attack

OnMouseOver to hide the Link

Server Form Handler (SFH)


Page Style & Contents

Spelling errors

Copying website

Using forms with Submit button

Using Pop-Ups windows

Disabling Right-Click

Web Address Bar

Long URL address

Replacing similar char for URL

Adding a prefix or suffix

Using the @ Symbol to confuse

Using hexadecimal char codes

Social Human Factor

Emphasis on security

Public generic salutation

Buying time to access accounts

E-banking Phishing Website Rating =) 0.3 * URL & Domain Identity crisp [First layer] + ((0.2 * Security & Encryption crisp+ (0.2 * Source Code & Java script crisp)) [Second layer] + ((0.1 * Page Style & Contents crisp) +(0.1 * Web Address Bar crisp) + (0.1 * Social Human Factor crisp)) [Third layer] (Aburrous, Hossain et al. 2010)

Mining tools :

WEKA and CBA package

Mining algorithm :

Association finding : used the apriori and predictive apriori algorithm using WEKA.

According to strategies used in learning from data , five different Data Mining algorithms (C4.5, Ripper, Part, Prism, CBA) were chosen for assessment .

C4.5 algorithm: It employs divide and conquer approach. (Aburrous, Hossain et al. 2010)

RIPPER algorithm: It uses separate and conquer approach. It generates one decision tree and uses pruning techniques to simplify it; each path from the root node to one of the leaves in the tree represents a rule C4.5 algorithm: It employs divide and conquer approach. (Aburrous, Hossain et al. 2010)

PART algorithm: It adapts separate and conquer (RIPPER algo approach) to generate a set of rules and uses divide and-conquer (C4.5 algorithm approach, but the difference is, it choose only one path in each of the built partial decision trees to derive a rule and then , discard it and all it's associated instances when rule is generated ) to build partial decision trees. C4.5 algorithm: It employs divide and conquer approach. (Aburrous, Hossain et al. 2010)

PRISM: It is a classification rule which can only deal with nominal attributes and doesn't do any pruning. It implements a topdown (general to specific) sequential-covering algorithm that employs a simple accuracy-based metric to pick an appropriate rule antecedent during rule construction. C4.5 algorithm: It employs divide and conquer approach. (Aburrous, Hossain et al. 2010)

CBA algorithm :It employs association rule mining to learn the classifier and then adds a pruning and prediction steps. This result in a classification approach named associative classification. C4.5 algorithm: It employs divide and conquer approach. (Aburrous, Hossain et al. 2010)

Outcomes and Benefits :

The fuzzy data mining e-banking phishing website model manifested that URL & Domain Identity Security & Encryption plays significant role in the final phishing detection rate result. Certain new correlation and relationship were deduced like the conflict of using SSL certificate with the abnormal URL request and phishy characteristics and layers etc. (Aburrous, Hossain et al. 2010)

To determine the key features in the e-banking phishing website archive data using classification algorithms is difficult problem and requires some intuition regarding the goal of data mining exercise.(Aburrous, Hossain et al. 2010)


A Novel Approach for Mining Association Rules on Sports Data using Principal Component Analysis: For Cricket match perspective

Background info of the organization and target application

The sports world is known for the assortment and vastness of data that is collected .Sports organizations, due to the extremely competitive environment in which they operate, need to seek any edge that will give them an advantage over others. It would appear that the culture has long encouraged analysis and discovery of new knowledge exhibited by video annotation. But, it is not possible to derive meaning from the provided data manually and to unearth the information and knowledge hidden in their data. Hence , this case study is an approach towards an automated framework to identify specifics and correlations among play patters. There are other existing approaches intended for the same purpose. But , they doesn't found out to be effective and generally limited to basic statistics. Therefore, new data reduction method (PCA) and a frequent pattern generation method are taken which later on proved to be competent.(UmaMaheswari and Rajaram 2009)

Description of the Data used in mining exercise

Since real time cricket data is too complex , Object-relational model is used to employ more sophisticated structure to store such data.

{ballid, Action1, Action 2, Action 3, Result}

Action{{<Entityl>,<Entity2>,..}, Relation, Dlist]

Entity {EName, Role, Attributeid}

Description {D1- D2-.......... Dn}

So as to extract task relevant subset of attributes, this structure was generalized using PCA. After PCA generated Frequent patters, the generalized attributes were:

Compressed pattern (numeric code is assigned to every possible pattern)

Decompressed form of pattern (Abbreviated form of pattern)

Eg. BG_PS_JT means Bouncing with Pull Shot with Just Try


No Run



Wide Ball


No Ball


Mining Algorithms:

1: Principal Component Analysis (PCA) which is also known as Karhunen-Loeve transforms.

Purpose: Dimensionally Reduction

Redundant or highly correlated data is removed by compressing data in order to find frequent patterns through covariance analysis.

Input: Generalized match dataset

Output: reduced dataset

2: Frequent pattern generation

Purpose: Frequent pattern analysis and Summarization.

Through a cutoff threshold interestingness of each frequent pattern is ascertained.

Input: compressed dataset

Output: frequent pattern table

3: Algorithm: Cricket-mine

Purpose: Association analysis on PCA generated frequent pattern set

Pattern that are having strong correlation , association or casual structure are deduced for the assessment.

Input: PCA generated frequent pattern set, compressed set, minconf-threshold

Output: Strong association rules.


Frequent pattern identification and rule representation are used to represent the inferred result in textual form. Consequently , the knowledge generated out of this process is seems to be more valuable and constructive enough in the sense that of easily understandable by all users. .(UmaMaheswari and Rajaram 2009)

Through PCA database size is abated to 18% memory space. The PCA based frequent pattern generation is proven to be more efficient than the existing widely used Apriori in terms of time taken for frequent item extraction and extraction of frequent patterns without making a single scan on entire database.(UmaMaheswari and Rajaram 2009)


Cricket match data is highly available and rapidly growing in size which far exceeds the human abilities to analyze. Therefore, this automated framework would help in identifying specifics and correlations among play patterns, so as to bring out knowledge meticulously. This knowledge can further represented in the form of useful information in relevance to modify or improve coaching strategies and methodologies to confine performance enrichment at team level as well. .(UmaMaheswari and Rajaram 2009)

This work can be modified for other games like football , basketball etc.(UmaMaheswari and Rajaram 2009)


Application of Data Mining Techniques for Medical Image Classification

Background info of the organization and target application

Breast cancer is a disease in which malignant (cancer) cells form in the tissues of the breast. Breast cancer is the second leading cause of cancer deaths in women today (after lung cancer) and is the most common cancer among women, except for skin cancers. Millions of women are expected to be diagnosed annually with breast cancer worldwide. Therefore, the need of hour is to explore a better and efficient mining technique for an automated framework known as mammography which can diagnose the tumor using imaging data. Case study demonstrates the use and effectiveness of two different data mining techniques used i.e. neural networks and association rule mining in image categorization.(Antonie, Zaiane et al. 2001)

Description of the Data used in mining exercise

So as to augment the essence of feature extraction phase two technique are applied : a cropping operation(to cut the black parts and existing artefacts from image ) and an image enhancement (quality improvement ). After this, features relevant to the classification are extracted from the cleaned images. (Antonie, Zaiane et al. 2001)

The existing features are:

Location of the abnormality(like the centre of a circle surrounding the tumour)


Breast position (left or right)

Type of breast tissues(fatty, fatty-glandular and dense)

Tumour type(benign or malign)

The extracted features are four statistical parameters:





Data Mining Tools:

Association Mining Algorithm

In the training phase

Apriori Algorithm : applied on the training data to discover association rules among the features extracted from the mammography database and the category to which each mammogram belongs. (Antonie, Zaiane et al. 2001)

In the classification phase

The low and high thresholds of confidence are set such as the maximum recognition rate is reached(Antonie, Zaiane et al. 2001)

Neural Network

Back-propagation algorithm: It is an extension of the least mean square algorithm that can be used to train multi-layer network. It is approximate steepest descent algorithms that minimize squared error. It uses the chain rule in order to compute the derivatives of the squared error with respect to the weights and biases in the hidden layers. (Antonie, Zaiane et al. 2001)

Outcome and benefits

Computer aided diagnosis has higher rate of detection, because sometimes, experience radiologists can't detect tumor. Mammography assists the medical staff to achieve high efficiency and effectiveness. .(UmaMaheswari and Rajaram 2009)

It allows studying of crucial values and parameters dependencies in a limited imaging data volume. However, the accuracy was still good. The project base is partly implemented and the pre-processing of mammography and the extraction of features should be dictated by rules that make sense medically(Antonie, Zaiane et al. 2001

The back propogation method proved to be less sensitive to the database imbalance at a cost of high training times. On the other hand association rule, with a much more rapid training phase, obtained better results than reported in literature on a well balance dataset. Both methods performed well, obtaining a classification accuracy over 70% for both techniques It proves that association rules mining employed in classification process is worth further investigation and larger mammographic database could be used to extract more feature from images in near feature.(Antonie, Zaiane et al. 2001)