# Field Of Data Mining Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The field of data mining is used to extract useful information, identify the concealed patterns and identical attributes within big body of dataset. Data mining provides a powerful support for decision-making through the application of supervised and unsupervised data analysis techniques. In data mining, dealing with data means to group data into a set of categories or clusters either with or without supervision. In order to learn new artifacts or understand new domains, researchers always look into the patterns and properties that can define and compare them with other known notions based on the similarity or dissimilarity of their attributes according to well-defined rules.

Data mining projects have been utilizing different techniques such as clustering, prediction, association, classification, sequential patterns and decision tree. These data mining techniques are briefly explained as under:

## Association

This technique is also known as relation technique, in association, based on a relationship between items in the same operation a pattern is discovered. Most common example of this technique is market basket analysis to recognize the purchasing trends of consumers associated different products.

## Classification

Classification is a typical data mining technique for machine learning. It is utilized to classify predefined set of classes based on each item in a set of data. Mathematical techniques like neural network, statistics, decision trees and linear programming are used to perform classification.

## Prediction

Prediction is a data mining techniques that discover relationship and dependencies of different variables. Independent variables relationships and dependent and independent variables relationship are discovered. Based on the historical data fitted regression curve can be drawn to be used for future prediction.

## Sequential Patterns

Sequential patterns analysis technique that seeks to explore or identify similar patterns, consistent events or trends in transaction data over a business time period.

## Decision trees

Decision tree is one of the most used data mining techniques because of its ease to understand & uses. In decision tree technique, the root of the decision tree is a condition that has different answers. Each answer leads to a set of conditions to help us process the data so that final decision can be made.

## Clustering

Clustering is a data mining technique that automatically creates suitable cluster of objects which have similar characteristics. Clustering technique is unsupervised as compared to classification technique in which objects are assigned into predefined classes. Clustering defines the classes and places objects in each class based on similar properties.

Data mining systems are either supervised or unsupervised, depending on whether the domain is already known or not. If domain is known then separate supervised classes are defined for making it supervised classification or if domain is unknown then unsupervised clustering is performed where exploratory data analysis is done to identify the hidden data patterns.

Clustering technique is an undirected data mining technique for unsupervised machine learning used to place individual artifacts into relevant groups without prior knowledge of distinct group properties to explore structure in the data. Clusters automatically link hidden patterns by learning the data pattern which is then utilized for machine learning. The aim of clustering is to make unlabeled dataset into isolated set of data structures by learning of a hidden data concept. For example, the spending behaviors of different segments population can be compared to find out which segments to target for a new product release.

Clustering is an initial and fundamental step in data analysis. Historically clustering has its foundations laid down by mathematics, statistics, and numerical analysis. It is an unsupervised classification of patterns into groups of similar objects, so patterns in a cluster are more alike to each other than to a pattern related to other cluster. It identifies groups of related records that can be used as an opening point for exploring further associations. Clustering can be classified into the following five major types:

Partition-Based

Hierarchical

Density-Based

Grid-Based

Model-Based

Clustering aims to facilitate unsupervised machine learning therefore algorithms are used for the processing of data. One of the biggest challenges in cluster analysis is to decide which algorithm is to be used for a specific problem. Algorithms differ in their execution characteristics, creating discrete cluster analysis models. Understanding these analytical models is very important in understanding the differences between the outputs of different algorithms. Typical cluster models include:

Connectivity models

Centroid models

Density models

Subspace models

Graph-based models.

Distribution models

Group models

Out of various techniques in data mining Clustering is one of the most challenging techniques used for knowledge discovery process. The goal is to identify clusters without any prior knowledge to differentiate the attributes of different clusters. Clustering techniques are used for correlating identified artifacts into groups based on the following criteria:

Each cluster is homogeneous in nature.

Each cluster should be diverse in nature from other clusters.

There is no doubt about the usefulness of clustering in various arenas i.e. Geo informatics, Web mining, Bio informatics, Market research, Market segmentation, Image processing, Document categorization, Machine learning & Pattern recognition.

This makes clustering a technique that combines techniques from different disciplines, such as mathematics, statistics, biology, artificial intelligence, databases, informatics etc., utilizes them to process data.

The unsupervised nature of the problem implies that its structural characteristics are unknown, making spatial distribution of the data in terms of the number, volume, density, shape, and orientation unknown. Like many other areas of scientific research clustering also, when applied to data mining applications, encounters three additional complications:

Huge data repositories

Objects having different characteristics

Numerous attributes types.

Challenges are faced because of above mentioned complication as they require more resources for processing the data. Without domain knowledge various clustering solutions appear equally reasonable about the basic data distributions.

By default clustering poses different problems for which each solution might be violating at least one rules regarding scale invariance, richness, and cluster consistency, all these properties / rules are defined to enhance the credibility of clustering techniques as if we do not have equal variance then it will be impossible to avoid clusters that are dominated by variables having most variation, same is the case with cluster consistency and richness. If there is lack of consistency between data partitions then it will be again a serious threat to the credibility of clusters formed.

Based on different assumptions clustering techniques uses certain data model and there are chances that due to misguided assumptions we might have chosen wrong model to apply on sample data, causing erroneous or unrelated results. So it is important that domain knowledge data is available for successful clustering and there are chances that even domain experts might not be able to provide such crucial information. To establish strong grounds for the sample data's distribution or processing it to the proper number of clusters we require identify relevant subspaces or visualization of domain knowledge. Efficient and effective methods are required to strengthen the individual clustering algorithms due to exploratory nature of clustering tasks.

## Literature Review

The ability to monitor the progress of students' academic performance is a critical issue to the academic community of higher learning. A system for analyzing students' results based on cluster analysis and uses standard statistical algorithms to arrange their scores data according to the level of their performance is described. In this paper, we also implemented k-mean clustering algorithm for analyzing students' result data. The model was combined with the deterministic model to analyze the students' results of a private Institution in Nigeria which is a good benchmark to monitor the progression of academic performance of students in higher Institution for the purpose of making an effective decision by the academic planners. In this paper, we provided a simple and qualitative methodology to compare the predictive power of clustering algorithm and the Euclidean distance as a measure of similarity distance. We demonstrated our technique using k-means clustering algorithm [6] and combined with the deterministic model in [7] on a data set of private school results with nine courses offered for that semester for each student for total number of 79 students, and produces the numerical interpretation of the results for the performance evaluation. This model improved on some of the limitations of the existing methods, such as model developed by [7] and [8]. These models applied fuzzy model to predict students' academic performance on two dataset only English Language and Mathematics) of Secondary Schools results. Also the research work by [9] only provides Data Mining framework for Students' academic performance. The research by [10] used rough Set theory as a classification approach to analyze student data where the Rosetta toolkit was used to evaluate the student data to describe different dependencies between the attributes and the student status where the discovered patterns are explained in plain English. Therefore, this clustering algorithm serves as a good benchmark to monitor the progression of students' performance in higher institution. It also enhances the decision making by academic planners to monitor the candidates' performance semester by semester by improving on the future academic results in the subsequence academic session.

There are number of techniques proposed by several researchers to analyze the performance of clustering algorithms in data mining. All these techniques are not suggesting good results for the chosen data sets and for the algorithms in particular. Some of the clustering algorithms are suit for some kind of input data. This research work uses arbitrarily distributed input data points to evaluate the clustering quality and performance of two of the partition based clustering algorithms namely k- Means and k-Medoids. To evaluate the clustering quality, the distance between two data points are taken for analysis. The computational time is calculated for each algorithm in order to measure the performance of the algorithms. The experimental results show that the k-Means algorithm yields the best results compared with k-Medoids algorithm. The results of both the algorithms are analyzed based on the number of data points and the computational time of each algorithm. The behavior of the algorithm is analyzed by observations. The number of data points is clustered by the algorithm as per the distribution of arbitrary shapes of the data points. Time complexity analysis is a part of computational complexity theory that is used to describe an algorithm's use of computational resources; in this case, the best case and the worst case running time expressed. From table 1, the maximum and minimum time taken by the k-Means algorithm is 172 and 156 respectively. Like, from table 2, 221 and 196 are the maximum and minimum time taken by the k-Medoids algorithm. The performance of the algorithms have been analyzed for several executions by considering different data points (for which the results are not shown) as input (300 data points, 400 data points etc.) and the number of clusters are 10 and 15 (for which also the results are not shown), the outcomes are found to be highly satisfactory. Figure 4 shows that the graph of the average results of the distribution of data points. The average execution time is taken from the tables 1 and 2. It is easy to identify from the figure 4 that there is a difference between the times of the algorithms. Here, it is found that the average execution time of the k-Means algorithm is very less by comparing the k-Medoids algorithm. From the experimental approach, for the proposed two algorithms in this research work, the obtained results are discussed. The choice of clustering algorithm depends on both the type of data available and on the particular purpose and chosen application. Usually the time complexity varies from one processor to another processor, which depends on the speed and the type of the system. The partitioning based algorithms work well for finding spherical-shaped clusters in small to medium-sized data points. The efficiency of the algorithms for the arbitrary distributions of data points is analyzed by various executions of the programs. Finally, this research work concludes that the computational time of k- Means algorithm is less than the k-Medoids algorithm for the chosen application. Hence, the efficiency of k-Means algorithm is better than the k- Medoids algorithm.

Developing effective clustering method for high dimensional dataset is a challenging problem due to the curse of dimensionality. Among all the partition based clustering algorithms, k-means is one of the most well-known methods to partition a dataset into groups of patterns. However, the k-means method converges to one of many local minima. And it is known that, the final result depends on the initial starting points (means). Many methods have been proposed to improve the performance of k-means algorithm. In this paper, we have analyzed the performance of our proposed method with the existing works. In our proposed method, we have used Principal Component Analysis (PCA) for dimension reduction and to find the initial centroid for k-means. Next we have used heuristics approach to reduce the number of distance calculation to assign the data point to cluster. By comparing the results on iris data set, it was found that the results obtained by the proposed method are more effective than the existing method. We evaluated the proposed algorithm on iris data sets from UCI machine learning repository [9]. We compared clustering results achieved by the k-means, PCA+k-means with random initialization and initial centers derived by the proposed algorithm given in Table 4. Table 2 shows the results obtained by paper [10]. Table 3 shows the results obtained by paper [7]. The initial centroid for standard k-means algorithm is selected randomly. The experiment is conducted 7 times for different sets of values of the initial centroids, which are selected randomly. In each experiment, the accuracy and time was computed and taken the average accuracy and time of all experiments. In the proposed work the number of principal components can be decided by a contribution degree about total variance. Table 1 shows the results obtained by a principal component analysis of the Iris data. This shows that three principal components explained about 99.48% of all data. Therefore, there is hardly any loss of information along a dimension reduction. Results presented in Figure 2 demonstrate that the proposed method provides better cluster accuracy than the existing methods. It shows the proposed algorithm performs much better than the random initialization algorithm and other author's initialization method. The experimental dataset show the effectiveness of our approach. This may be due to the initial cluster centers generated by proposed algorithm are quite closed to the optimum solution and it also discover clusters in the low dimensional space to overcome the curse of dimensionality. In figure 3, we compare the CPU time (seconds) of the proposed method with the existing methods. The execution time of proposed algorithm was much less than the average execution time of k-means when used random initialization. Our proposed method provides higher accuracy than the other author's method and takes moreover equal time. The main objective of applying PCA on original data before clustering is to obtain accurate results. But the clustering results depend on the initialization of centroid. Our proposed method finding the initial centroid and cluster the data in low dimensional space. In this paper, we have analyzed the performance of our proposed method with the existing works. By comparing the results on iris data set, it was found that the results obtained by the proposed method are more accurate and efficient compared to the existing method. In our future work, we will apply our proposed method to microarray cancer datasets.

The DBSCAN [1] algorithm is a popular algorithm in Data Mining field as it has the ability to mine the noiseless arbitrary shape Clusters in an elegant way. As the original DBSCAN algorithm uses the distance measures to compute the distance between objects, it consumes so much processing time and its computation complexity comes as O (N2). In this paper we have proposed a new algorithm to improve the performance of DBSCAN algorithm. The existing algorithms A Fast DBSCAN Algorithm[6] and Memory effect in DBSCAN algorithm[7] has been combined in the new solution to speed up the performance as well as improve the quality of the output. As the Region Query operation takes long time to process the objects, only few objects are considered for the expansion and the remaining missed border objects are handled differently during the cluster expansion. Eventually the performance analysis and the cluster output show that the proposed solution is better to the existing algorithms. The basic DBSCAN, Fast DBSCAN and proposed Optimized DBSCAM algorithms are implemented in Visual C++ (2008) on Windows Vista OS and tested using two dimensional Dataset. To know the real performance difference achieved in the new algorithm, we haven't used any additional data structures (like spatial tree) to improve the performance. These algorithms are tested using two dimensional synthetic dataset and the performance results are shown below. Above table shows that the new algorithm's performance is better to the existing algorithms in terms of computation time and the new algorithm has small number of object loss than the Fast DBSCAN algorithm. In this paper we have proposed ODBSCAN algorithm to improve the performance with less amount of object loss. In this new algorithm FDBSCAN and MEDBSCAN algorithms approach has been used to improve the performance. Also some new techniques have been introduced to minimize the distance computation during the RegionQuery function call. Eventually the performance analysis and the output shows that the newly proposed ODBSCAN algorithm gives better output, at the same time with good performance. In this algorithm, all the border objects have been considered for the clustering process. But there are few possibilities to miss the core objects and which causes some loss of objects. Though the new algorithm gives better result than the previous FDBSCAN algorithm, this problem needs to be resolved in the further work to give the accurate result with same performance.

Some characteristics and week points of traditional density-based clustering algorithms are deeply analysed, then an improved way based on density distribution function is put forward. K Nearest Neighbor (KNN) is used to measure the density of each point, then a local maximum density point is defined as the center point. By means of local scale, classification is extended from the center point. For each point there is a procedure to find whether it is a core point by a radius scale factor. Then the classification is extended once again from the core point until the density descends to the given ratio of the density of the center point. The tests show that the improved algorithm greatly improves the sensitivity of density-based clustering algorithms to parameters and enhances the clustering effect of the high-dimensional data sets with uneven density distribution. In the paper, an improved clustering algorithm based on density distribution function is set up using the ideas of local scale and boundary threshold. The main characteristics has: it has a solid mathematics foundation and generalizes many other clustering ways, such as partitioning method, hierarchical one, density-based one, grid-based one, model-based one; For the data set with a lot of "noise", it has an excellent clustering performance; For the clustering of arbitrary-shape high-dimensional data sets, it gives a concise mathematics description; The point with maximum density is as the center-point , from which, the classification is extended to density boundary threshold. The purpose of the paper is aimed to improve the sensitivity of DENCLUE to parameters and enhances the clustering effect of the high-dimensional data sets with uneven density distribution. The tests show that these problems above are greatly modified.

Clustering is the process of classifying objects into different groups by partitioning sets of data into a series of subsets called clusters. Clustering has taken its roots from algorithms like k-medoids and k-medoids. However conventional k-medoids clustering algorithm suffers from many limitations. Firstly, it needs to have prior knowledge about the number of cluster parameter k. Secondly, it also initially needs to make random selection of k representative objects and if these initial k medoids are not selected properly then natural cluster may not be obtained. Thirdly, it is also sensitive to the order of input dataset. Mining knowledge from large amounts of spatial data is known as spatial data mining. It becomes a highly demanding field because huge amounts of spatial data have been collected in various applications ranging from geo-spatial data to bio-medical knowledge. The database can be clustered in many ways depending on the clustering algorithm employed, parameter settings used, and other factors. Multiple clustering can be combined so that the final partitioning of data provides better clustering. In this paper, an efficient density based k-medoids clustering algorithm has been proposed to overcome the drawbacks of DBSCAN and kmedoids clustering algorithms. The result will be an improved version of kmedoids clustering algorithm. This algorithm will perform better than DBSCAN while handling clusters of circularly distributed data points and slightly overlapped clusters. This Clustering is an efficient way of reaching information from raw data and Kmeans, Kmedoids are basic methods for it. Although it is easy to implement and understand, Kmeans and Kmedoids have serious drawbacks. The proposed clustering and outlier detection system has been implemented using Weka and tested with the proteins data base created by Gaussian distribution function. The data will form circular or spherical clusters in space. As shown in the tables and graphs, the proposed Density based Kmedoids algorithm performed very well than DBSCAN and k-medoids clustering in term of quality of classification measured by Rand index. One of the major challenges in medical domain is the extraction of comprehensible knowledge from medical diagnosis data. There is lot of scope for the proposed Density based K-Medoids clustering algorithm in different application areas such as medical image segmentation and medical data mining. Future works may address the issues involved in applying the algorithm in a particular application area.

## [1]

Practical implementation of clustering algorithm

Duplicate record assignment activity due to the implementation of two algorithms

Implementation of two different algorithms in collaborative manner for fact finding regarding student performance. Author has used f clustering algorithms along with statistical tools to identify the student's educational performance key characteristics and issues

No firm ground for the selection of K-means algorithm other than its simplicity on this particular type of data i.e. student's performance data.

The evaluation and analysis of the data is based on multiple datasets i.e. 9 subjects data sets.

Implementation of k-means clustering algorithm along with statistical analysis tool can cause decreased clustering analysis performance as duration / time utilized for applying both algorithms is not mentioned.

## [2]

Implementation of two clustering techniques i.e. K-means & K-Medoids to judge the suitability of technique particularly for arbitrary data type.

Cluster shape and distance type i.e. symmetric or asymmetric not properly defined as parameters for uniformity on the research analysis.

Analysis of two techniques based on time computational time calculation for performance measurement.

Behavioral analysis of the algorithms is done based on observation rather than on the basis of certain concrete statistical tool.

It was concluded that partitioning based algorithms are more efficient on spherical type clusters

No benchmark was set for the computational time analysis and result was based on lesser average time consumed for algorithm execution.

It has been described several times that performance of any algorithm has several affecting factors even then standardization of testing equipment, i.e. PC, was not done.