The Process Of Data Mining Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract-Data mining is a process where we extract patterns from data. The reason why we mine the data is to transfer the data in meaningful information where everyone can understand and get benefit from the information provided. It is become more useful as most industry use this tool to transform these data into information. Common types of applications or industry that use this tool are marketing, fraud detection and scientific discovery.


It is common to use word or term "data" when we are documenting paperwork or proposal but do we really understand the real meaning of data? Can we differentiate between data and information? Here is simple explanation on data and information. In real world, data means groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. An attribute is a property or characteristic of an object eye color of a person, temperature and etc while information is data that has been transferred into meaningful meaning that can be shared by many people. The difference between data and information is that data is a raw material that does not give any meaning to the end user while information is knowledge or transferred data that gives meaning to the end user.


What is data mining?

Instead of using the term data mining in the industry, we also use the term "Knowledge Discovery in Database (KDD). As mentioned before, data mining is a process of extracting patterns from data where it uses machine learning, statistical and visualization techniques to discover and present knowledge in a form which is easily comprehensible to humans.

Models involved in data mining

There are two models involved in data mining process which are predictive model and descriptive model. These models indicate how fast we process or evaluate the data before it becomes meaningful information. It's important to know the difference between these two models so that we are in a right path when extracting the data later.

The first method involved in this process is predictive model. Prediction model uses some variable to predict unknown or future values of other variables. It is falls under category of supervised learning, thus, one variable is clearly labeled as target variable Y and will be explained as a function of other variable, X. The nature of the target variable determines the type of model. Classification, regression and deviation detection fall under this model.

Descriptive model depicts or describes how things actually work, and answers the question, "What is this?" The model tries to find models for the data and the aim is to describe, not to predict the models. The tasks that fall under this model are clustering, association rule discovery and also sequential pattern discovery.


Traditionally clustering techniques are broadly divided in hierarchical and partitioning. Hierarchical clustering is further subdivided into agglomerative and divisive. The basic of hierarchical clustering include Lance-Williams formula, idea of conceptual clustering, now classic algorithms SLINK, COBWEB, as well as newer algorithms CURE and CHAMELEON.

Concept of clustering

To explain the concept of clustering, we take library as our example. In a library books concerning to a large variety of topics are available. They are always kept in form of clusters. The books that have some kind of similarities among them are placed in one cluster. For example, books on the database are kept in one shelf and books on operating systems are kept in another cupboard, and so on. To further reduce the complexity, the books that cover same kind of topics are placed in same shelf. And then the shelf and the cupboards are labeled with the relative name. Now when a user wants a book of specific kind on specific topic, he or she would only have to go to that particular shelf and check for the book rather than checking in the entire library.

The rest of the paper is divided into five sections.

Section 2 defines some frequently used terms

Section 3 describes some traditional approaches of clustering.

Section 4 gives the implementation of clustering in the field of Data Mining and The Medical field.

Section 5 includes a case study of Windows NT Operating System.

Section 6 concludes the paper.

This part shows how we use clustering in managing books in library so that readers will find it easier to find the books .

Elements of cluster

Types of attributes algorithm can handle

Scalability to large datasets

Ability to work with high dimensional data

Ability to find cluster of irregular shape

Handling outlier

Time complexity ( when there is no confusion, we use the term complexity)

Data order dependency

Labeling or assignment (hard or strict vs. soft of fuzzy)

Reliance on a priori knowledge and user defined parameter

Interpretability of results

Application in clustering

Clustering becomes more popular from day to day. There are a lot of applications in the industry use this task in order to manage their data so that everything will be in a correct way.

Similarity searching in Medical Image Database

"This is a major application of the clustering technique. In order to detect many diseases like Tumor etc, the scanned pictures or the x-rays are compared with the existing ones and the dissimilarities are recognized.

We have clusters of images of different parts of the body. For example, the images of the CT Scan of brain are kept in one cluster. To further arrange things, the images in which the right side of the brain is damaged are kept in one cluster. The hierarchical clustering is used. The stored images have already been analyzed and a record is associated with each image. In this form a large database of images is maintained using the hierarchical clustering.

Now when a new query image comes, it is firstly recognized that what particular cluster this image belongs, and then by similarity matching with a healthy image of that specific cluster the main damaged portion or the diseased portion is recognized. Then the image is sent to that specific cluster and matched with all the images in that particular cluster. Now the image with which the query image has the most similarities, is retrieved and the record associated to that image is also associated to the query image. This means that now the disease of the query image has been detected.

Using this technique and some really precise methods for the pattern matching, diseases like really fine tumor can also be detected.

So by using clustering an enormous amount of time in finding the exact match from the database is reduced."

This article shows that clustering also been applied in medication area. By using clustering, the doctors or nurses will find it easier for them to find the matches with the symptoms. This will reduce the time consume for searching the right disease as well as increase the quality of services provided by the doctors or nurses.

Data Mining

"A company that sales a variety of products may need to know about the sale of all of their products in order to check that what product is giving extensive sale and which is lacking. This is done by data mining techniques. But if the system clusters the products that are giving fewer sales then only the cluster of such products would have to be checked rather than comparing the sales value of all the products. This is actually to facilitate the mining process."

Obviously clustering is used in data mining in many industries. From the articles above, we can see that the business can easily detect which products that are affect their businesses rather than wasting their times comparing the sales value of all the products. Once again it is proven that clustering does minimize the time consume for the user when they are searching for information.

Windows NT

"Another major application of clustering is in the new version of windows NT. Windows NT uses clustering, it determine the nodes that are using same kind of resources and accumulate them into one cluster. Now this new cluster can be controlled as one node."

For this article, it shows that the use if cluster in a computing system is to make it possible to share a computing load over several systems without either the users or system administrators needing to know that more than one system is involved.

Dstar software

"D* is a data storage and retrieval system for advanced scientific studies. The system is designed to enable a tighter integration of data storage and data mining technologies. Through the application of innovative retrieval and clustering techniques for high-dimensional data, the system can support high-performance data access and provide data mining applications useful insights into the data that can facilitate subsequent processes of data preparation and data mining. The basic processes of D* include data clustering, space partitioning, data loading, and data retrieval based on region queries and similarity searching. D* scales well with increasing data dimensionality and works well on incremental load of data."

This article proves that clustering becomes one of the main components in developing software that really helps the businesses. By applying clustering in developing the software, the developer actually put his focus on minimizing time consuming in searching data needed into shorter time.


"Software package for multi-dimensional space reduction and data clustering."

This software is actually focused on space reduction where efficient clustering algorithm for high-dimensional data is applied. Besides that, this software also focuses on adjacency-connected agglomeration of dense cells.


In conclusion, we can say that clustering gives lot of benefits to the industry that applies it. The benefits are in terms of in minimizing time consuming, high-quality of insight and data, large pool of data warehousing and also the ability to perform maintenance and upgrades with limited downtime.