Knowledge Discovery Process In Data Mining Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data Mining is referred to as the discovery of valuable, non obvious information from enormous collection of data. Data mining is concerned with a grand deal of interest in the information industry because large amount of data are available. The industry side database system contains some functionality such as data collection, data management, database creation and advanced data analysis. The research and development in database systems since the 1970s has progressed from early hierarchical and network database systems to the development of relational database systems. In addition, users gained convenient and flexible data access through query languages, user interfaces, optimized query processing and transaction management.

Data stored in various forms of databases and data warehouse. Data repository architecture that has emerged is the data warehouse, a repository of multiple heterogeneous data sources organized under a united schema. Data warehouse technology includes data cleaning, data integration and OLAP. OLAP tools support multidimensional analysis and decision making,

Additional data analysis tools are required for in-depth analysis, such as data classification, clustering, and the characterization of data changes over time. In addition, huge volumes of data can be accumulated beyond databases and data warehouses. Typical examples include the World Wide Web, data streams, as in applications like video surveillance, telecommunication, and sensor networks. The effective and efficient analysis of data in such different forms becomes a challenging task.

Data mining is contouring the transformation of masses of information into significant knowledge. It is a procedure used to find underlying truths of random data for discovering new opportunities. The exposed pattern focuses on application problems and assists in more useful, proactive decision making. The important techniques of data mining involve decision trees, neural networks, nearest neighbour clustering, fuzzy logic and genetic algorithms. Data mining additionally used business industry having different transaction for different operations.


The KDD process consists of several steps such as data cleaning, data selection, data integration, data mining, knowledge representation, data transformation and pattern evaluation. Data cleaning is called as data cleansing which used to clean the noise and irrelevant data. Data integration means several heterogeneous data sources are integrated together to convert it into a single data base source. Data selection means retrieves the data based on the query analysis from data collection.

Data transformation is one of the important steps in KDD. It is known as data consolidation. Relevant data are selected and transformed into forms. These forms are used for analysis procedure. Data mining is one of a clever technique and it is used to extract the useful information. Pattern evaluation means a pattern which represents the knowledge can be measured and interesting patterns are measured.


The typical data mining architecture consist of some elements database, knowledge base, warehouse server, analysis, user interface, data mining engine and pattern evaluation. The set of databases which is used to store the data in warehouse, spreadsheets and other type of repositories. Knowledge base is known as domain knowledge is used to evaluate the pattern; so many knowledge is used to be extracted such as concept hierarchy which is organized the attributes into different set level of abstraction.

Warehouse server is extracted the information based on the queries given by the user. The extracted information is stored on the data repositories. Analysis a collection of modules Perform some task like characterization, association and correlation study, cluster analysis and outlier analysis. User interfaces are transmits the data between the user and the KDD system and to permit the user for extracting and providing the information, specifying a mining query. Pattern evaluation means to evaluate the patterns based on their performance and also interact with the KDD Parts to filter out discovered patterns with the use of threshold.


Data mining consist of several phases such as business understanding, modelling, data preparation, data understanding, evaluation and deployment. Business understanding means first understand the basic business process next understand the problem then planning for solve the problem whereas consider resource in hand; finally define the goals and objective for final achievement. Data understanding means search from first to last historical data these data are relate with each other finally find the hidden data.

Data format comes into desired form this phase is called as data preparation. It handled noisy and missing data. Modelling is the fourth phase in CRISP-DM model. In this phase, develop model used for future prediction. Many different modelling techniques used and different parameters also used for improve the results. Evaluation phase wants to evaluate in terms of cost, response time, error rate, confidence level and many other. Deployment is the final phase in CRISP-DM model. The performance of the business greatly improves by means of the reports generated.


In potential application data mining used in many ways direct marketing, market segmentation, market basket analysis, fraud detection, interactive marketing, insurance claim analysis and trend analysis. Direct marketing recognize which aspects should be enclosed in mailing list to receive the maximum response rate. Market segmentation recognizes same behaviour of customers for who purchase the same products. Market basket analysis realizes what products are usually purchased. Fraud detection recognizes the fraudulent transaction. Interactive marketing used to predict the frequent access of the website by the user. Trend analysis exposes the variation between two trends.


It contains six common classes of task such as anomaly finding, clustering, association law learning, classification, regression and summarization. Anomaly detection recognizes the odd data records, data errors need extra investigation. The relationship among the variable is searched by using association rule learning. Clustering is a task of observing groups, sometimes this data are structured way sometimes the data are not using structures. Classification task predict the class label known as structure of data. To detect a function data on the least error these efforts is called as regression. Summarization is giving a many consolidated representation of a data set, included report generation and visualization.


Data mining contains many techniques such as classification, estimation, prediction, association and clustering. Classification is the process of classifying the different items or patterns set of training data is based on the derived model with the class label is known. Various form of data model such as decision tree, like a flow chart structures and neural networks.

Classification has many applications such as prediction of consumer behaviour and identifying fraud. For example, a credit card company may have a simple data of past applicants and knowledge about the applicants that were good credit risks and those that were not. A classification method may use the sample to derive a set of rules for allocating new applications to either of the two classes.

A classification process in which classes have been pre defined needs a method that will train the classification system to allocate objects to the classes. The training is based on a training sample, a set of sample data where for each sample the class is already known. We assume each object to have a number of attributes, one of which tells us which class the object belongs to. The attribute is known for the training data but the data other than the training data (we call this other data the test data) we assume that the value of the attribute is unknown and is to be determined by the classification method. The attribute may be considered as the output of all the other attributes and is often referred to as the output attribute or the dependent attribute. The attributes other than the output attribute are called the inner attributes or the independent attributes.

Prediction values are the continuous valued function that to predict the numerical and the categorical based data. By comparing classification and prediction methods the accuracy of classifier ability to properly predict the class labels. The predictor accuracy refers to guess the value of the predicted attribute. Prediction and classification are making accurate prediction specified noisy data and data missing values in robustness. In term of scalability classification and prediction are given huge amounts of data. In terms of interoperability is a subjective so it access is more difficult.

It examines that the data objects without conferring the known class library. It does not contain any training set of data. It used to form groups of relevant variables and the data based on their size of data files for the different types of clustering is used in the mining system. It can be used to produce the class labels. The elements are grouped depends on a condition of improve intra class similarity and decrement inter class similarity. When compared to another one object within the cluster have high similarity, but very different to elements in other group of cluster.

So many clustering algorithms are deployed in the mining system. Requirement of clustering are Interpretability, usability, detect the outliers.The K-means cluster partition the entire datasets into disjoint subsets. These subsets are called the clusters. Cluster is formed by the important parameter mean value. Security tools are used in the K-means clustering. Accuracy of the data in the distributed environment is achieved by the clustering process.Within the centred based technique the data element from one cluster is compared to the number of elements in the other cluster. Each of the cluster or groups having the same set of data or may be containing the different set of data.


Data mining consist of several issues such as security & social issues, mining methodology issues, user interface issues, performances issues and data source issues. By user interaction issues used in many ways mining is a various forms of knowledge information, reciprocal mining is knowledge of several levels of abstraction, merging of background knowledge, query languages of data mining, visualization and reflection of mining results and dealing with noise and uncompleted data.

By performance and scalability is used to efficient mining algorithms such as distributed, parallel and incremental mining methods. An issue linking to the variety of data type’s first dealing with complex and relational types of data then mining also related to heterogeneous and global information. Issues linking to social impacts and applications


Health Care:

Health care is the main part of the Data mining applications. It is mainly used for Diagnosis, patient profiling and history generation. Mammography is one of the methods used to detect breast cancer. Computer aided methods are used to detect the tumours. It is very helpful for the medical staff. Tumour classification in mammograms is classified by neural networks with back propagation and association rule mining. To detect the lung abnormality, data mining is effectively used. To reduce the patient’s risk and cost, data mining significantly used. The analyzation of medical record is very complex and difficult. Patient record can be mined and mined record is used to improve the patient care.

Web Education

Web education is an important application of Data mining. It is used to improve the courseware. This data mining results are mainly used by the teachers or the course author. To improve the effectiveness of the course, Data mining are used. Data mining is used as the learning material.

Business and Finance:

To buy a new policy insurance and to define different views of the risky customers. To meet the customer needs, Data mining are used in the banks. It is very helpful to Lenders to detect about the loan details.

Sports and Gambling:

Sports and gambling are one of the important applications. It is helpful to predict the winners and it is used to predict the team that will be chosen for tournament .To analyze the players profile, Data mining is used.

The Intrusion Detection in the Network:

The Int rusion Detection is another important application in the KDD. To detect the traffic and to secure the computer. To preserve the security of the network, the intrusion detection is used. It is very difficult to watch the traffic. Intrusion detection is used to detect the abnormal conditions of the traffic.

The Intelligence Agencies:

The Intelligence Agencies collect different information and analyze that information to investigate the activities of the terrorist. It is difficult to analyze the large volume of data. It is difficult to detect activities of the criminal and the terrorists. Intelligence agency is used to handle the Organizations having large databases. Clustering techniques are also used.