K Means Clustering One R Classification Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Nowadays, data mining plays as the powerful technology trend with the greatest potential to help societies, to concentrate on the most important information in the data that they have gathered on the behaviors of their clients and their potential clients. Data mining tools says forward future tendencies and behaviors; allowing to make business hands-on, and decisions led by knowledge.

Clustering is also the fundamental technology used in data and gives several algorithms to access the large groups data based on specific parameters and related data point. Classification algorithm is used to predict discrete or nominal values depending on the other attributes in the dataset and it is the easiest way to analyze.

In this paper, we propose about the basic technologies of the data mining, overview of clustering and classification which is related to the two data mining algorithm with two different datasets that we have chosen, initially, the k means clustering algorithm with the "Facial Palsy" and "Poem" dataset but we are just using only 30 attributes and 30 records from "Facial Palsy" and then proposing the development of data mining algorithm which is based on one-R rule based classification algorithm with 'Poem' dataset. These algorithms have been developed in Java for integration with Weka Machine learning software. The details description, dataset and algorithms will be discussed in the following sections.

Table of Contents


Data mining sometimes called knowledge discovery is the process to examine data from different perceptions and to sum it up in valuable information. The technologies which used in data mining are often prominent in precise algorithm. It was introduced in the 1990s, but it is the evolution of filed with a long history. It is true that data mining is popular as Knowledge Discovery in Databases (KDD), but actually it is only an important part of it.

Data mining is a dynamic application to foretell indefinite data and it examine the models and the data connections between data sets and databases by using data analysis tools like statistical modes, method of erudition of machine and the mathematical algorithms. At present, we use data mining which pull out faster because it has many issues. The quality of data mining is; it can collect data automatically which make illustrations of the unmatched model, the volume of data becomes quicker, so that the old data are never analyzed, the equipment and the research software of data are very inexpensive, quick and additional multimedia types of data in present types, spatial data and multimedia data, from the 2D and 3D change in hundreds and thousands of measures and connection of low price of the computerized analysis of data but reasonably high cost by manual analysis of data.

However, the application of these techniques returned problems by the availability augmented by data and by power of treatment and of economic stocking possible. So another way of using data mining software is one of the analytical tools for evaluating the data. That software allows the users to analyze the data from many different dimensions or angles, categorize it, and summarize the relationships identified. Theoretically data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Popular techniques of data mining are:

Neural networks

Support vector machines

Decision tree

K-nearest neighbor


Classification etc…

Fig: Data mining flow process

1.1 Applications

Data mining generally has a large number of wide ranges of applications in various fields. Sales forecasting, weather forecasting, medical diagnosis and pattern recognition is the traditional use of very little. However, due to the recent explosion in data recorded in the field and extensive development of data mining capabilities far beyond the basic functionality. Fraud detection, spam filtering, targeted advertising campaigns and cyber-crime analysis and data mining has been extended in some areas.

1.2 Weka Machine Learning Tool

Weka is an open source-based data mining and machine learning algorithms, many of Java, including the former data, classification clustering, and association rules extraction and processing of the collection. Weka machine learning framework consists of 49 data-processing tools before, and 76 classification / regression and clustering algorithm 8. It also contains 15 attribute/ subset evaluators with 10 search algorithms for feature selection property / set of assessments. It allows users to load the machine ARFF files, DAT, British Standards Institution and other formats. User can do pre-processing of data and data classification and clustering can be set with different options tested. The result can be saved as a model to re-evaluate again. It supports a variety of standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Weka software consists of three graphical user interface "Explorer", "experimenter", "the KnowledgeFlow" and "SimpleCLI". With the Weka use of a machine learning tool, available from the database were analyzed in the large, manual useful knowledge.

1.3 Purpose of this paper

The ultimate objective is uses two different data set two different algorithms and the appraisal, like accuracy, accuracy, recall, specificity, sensitivity, ROC area several parameters, performance and in suggestion foundation for each data set optimum result algorithm. The following section will be provided the data set approach, uses the method appraises and recommends the best algorithm.

2.0 Clustering of Algorithms

Clustering is the process of grouping a set of physical or abstract objects into classes of similar object. In data mining, it is a machine learning technique used to place data elements into related groups without advance knowledge of the group definitions. It is also can be viewed as the unsupervised learning concepts. Cluster analysis is an important aspect of human commotion. Early in childhood, we find out how one makes a distinction between cats and dogs or between animals and works, continually involuntary gathering pattern improving. Fact to bring us together automated, we can identify regions dense and scattered in the space of object and, therefore, discover distribution drawings the complete and interesting correlation between the data attributes.

Aggarwal et al. defined as the process of clustering: "The consideration of some points in the multidimensional space, find a partition of points in groups so that points in every group are the one near other one" (C. Aggarwal, J. Wolf, P. Yu, C. Procopiuc, and J. Park , 1999).The intimacy is mearsured, different algorithm metrical specifically, for example, using the closer some two points are strongly connected mutually, the more them. This process has the groupings of such unities as a result where a strong inter-group and weak intra-cluster relations of the intra-group exist between and under points, shown in the following figure.

Figure 1: Example for the representation of clustering point

Many applications, for example marketing research, pattern recognition, data analysis and imagery processing cluster analysis. In business, clustering can help marketers to discover distinct groups in their client's bases and to distinguish groups of client based on purchasing guides. In the biology, it may use for to draw up the installment and the classification of animals, carries on the classification to have the similar function gene, and tries to understand that in the crowd the inherent structure is better. Clustering may also help in the identification of areas of the use of the similar land in a data bank of the observation of the earth and the identification of groups of houses in a town according to the type of the house, value and geographical place. Clustering can also be used for outlier detection, some applications of outlier detection such as credit card fraud detection and the monitoring of criminal activities in electronic commerce.

3.0 Classifications of Algorithms

The classification is one of most frequently analyzed problem by machine learning and data mining researchers. It consists by forecast foundation in other attribute value attribute value. There are different classification approaches, like available such as:-

Statistical classification is a method or in all single projects to inherent behavior, what is from the early time is in the archery target label project carries on the project information in the foundation group's technology. Statistical KNN algorithm some examples, essence and linear discriminated analysis.

The decision tree completes a condition hierarchical structure organization. Basically, it is a forecast model, each instance carries on the classification, until will achieve a leaf node to correspond to a kind of label; fundamentally certain condition was a foundation. The decision tree may transform is a classifying rule. One of most famous algorithms is J48.

The rule induction follows IF - THEN rules, these rules for foundation observation collection extraction. This algorithm may take one kind of auxiliary machine learning or also may use in based on the heuristic search use. And most well-known rule induction algorithm some CN2 network, genetic algorithm Apriori and use real-valued genes.

One kind of fuzzy rule induction algorithm application fuzzy logic, explains the related data language. In order to describe fuzzy system's fuzzy rule structure and the parameterization district must to all variables. Some fuzzy rule algorithm AdaBoost, Grammar based heredity plan. The neural network is also used for an achievement rule induction. This is also called parallel distribution processing the network.

After a neural network is one kind of computation pattern is loose, in the cerebral cortex structure relates in together. This includes for the node, works in together, and has the incident cross-correlation factor which an output function transfers. In some examples are the radial direction primary function neural network and the multi-layered sensation.

3.1 Training Phase

For the learning model predict that assigns the mark in the situation to use the data set (training regulations), including group of attributes, which can take classification class a kind of label which uses for them.

3.2 Prediction Phase

Through the use predict model's class and through the analysis data set the mark, an example kind of label which cannot see. This algorithm some can only forecast the dual division, but some may forecast N1 level, some also for each kind of N probability.

The test collection is uses in determining the model the accuracy. In the ordinary circumstances, assigns the data set to divide into the training and the test, the latter as the sample, will confirm this model itself to wrap.

3.3 Popular Classification Techniques

Decision Tree based Methods

Rule-based Methods

Neural Networks

Memory based reasoning

Support Vector Machines

3.4 The Usage

In classification of data analysis that may be used in withdrawing the model description class important data. Such analysis might help to provide the general data to understand us well. But explicit classified predicts (separate, disorder) label, successive value function forecast model. For example, we may establish a disaggregated model classification for the security or the danger, or a prediction model to predict assigns income and occupation in dollar of computer equipment customer mortgage application.

The bank loan personnel need her data analysis, understood that which loan the applicant are "safe ", which belong to "the high risk "bank.

Data analysis which needs in Manager All Electronics marketing helps to guess whether with to assign configuration files' customer will purchase a new computer.

Medical research personnel want to analyze breast cancer's data, forecast which treats specifically three patients should accept.

In each of these examples, the data analysis task is classification.

For categorical labels, such as "safe" or "risky"

For the loan application data; "yes" or "no" for the marketing data; or

For the medical data; "treatment A," "treatment B," or "treatment C"

These categories may by the discrete value, its intermediate quantity sorting not have the significance.

4.0 Dataset Overview

Overview of dataset for clustering and classification algorithms which are related to the two data mining algorithms with two different datasets that we have chosen, initially, the k- means clustering algorithm with the 'Facial Palsy" dataset but we are just using only 30 attributes and 30 records from it and then proposing the development of data mining algorithm which is based on one-R rule based classification algorithm with 'Poem' dataset.

4.1 Facial Palsy

This data set contains 66 principal constituents to include 66 sea bright attributes and 1 classification uses in distinguishing is away from image 50x50 the example production example. A kind has represented the value, contracts the serious facial paralysis face pattern recognition from a person who. -1 representative's social class values, from a normal person who also non-facial nerve paralysis face pattern recognition.

4.2 Poem

This data set contains 76 data instances of poems; each line has 1410 attribute values and 1 classifier to represent the documents type example. +1 has represented the poem value that depressing classification and - 1 has represented the interesting poetic composition for the division poem value.

5.0 Methodology

In this report, we will focus on describing the two techniques or algorithms from clustering and classification.

5.1 K-Means Methods

The cluster algorithm generally uses the way which unsupervised fashion. They promoted one to the data put on the same place, as a result of some similar concept group case. Purpose of the cluster can assign in association's opinion, in order to each group's opinion, benefit attribute, they above it other community's ownership. The algorithm has all functions, only describe each object visit, this is the information which has not given, in which district should each case. However, in practical application's area, it is the usual situation is the laboratory technician has a basic area of knowledge and the data set, possibly is the beneficial element combination are very actually few. Traditional cluster algorithm any way has not used these information, even if their real existence. Therefore, our association has the cluster algorithm which the interest develops.

There has many, like the clustering method, the partitioning method, based on the hierarchical method, based on grid methods and so on cluster method, based on the model and the user guidance or the restraint is a foundation and we choose the non-hierarchical clustering method, namely so-called, the K-Mean method, this method is the simple the form. The K-Mean cluster is a machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. This is a division cluster method, always has the K cluster, in which "K "is stand for number of cluster, this is a user input algorithm standpoint. This algorithm production appropriate globular cluster. Various special groups does not have the rank, they do not overlap. K-Mean method statistics, non-surveillance's nature and iteration. The K-Mean method follows these steps.

1.Arbitrarily selects k points as the initial cluster centers

2. Decide the class memberships of the objects by assigning them to the nearest cluster center.

3. Compute the centroids of the clusters of the current partition by assuming the memberships found above are correct. Equation for centroids is

4. Repeat step 2 and 3 until no more objects changed membership in the last iteration.

The K means method is commonly used in medical imaging, biometrics and related fields.


Fig: The step of the k-mean algorithm


If K is small, K-Means may be computationally faster than hierarchical clustering with a large number of variables

K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Relatively efficient: O (tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.


Does not work well with non-globular clusters

Fixed number of clusters can make it difficult to predict what K should be

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes.

5.2 Rule-based Methods

Each classification approach uses an algorithm generative rule the form training regulations. Then those rules are applied for test advertisement in the test dataset. Provides the mechanism based on the rule algorithm, and maximum limit displays the probability rate classifying rule which through the centralism production specific kind of time expected.

One R Algorithm

One R algorithm is developed by the Ross. It is also called an attribute rule. It including has a condition a part, can with the real world data, a better attribute. Every in One R algorithm (an attribute rule), was must discover an attribute classification to establish and to classify, had determined one kind of recent data point, this will cause the least prediction error, because the low error rate refers to a precision higher in accuracy. This is also a very easy classifying rule, may produce a horizontal checkout decision tree in there, and has an attribute.

For example, the classification looked that we may use the following rule to be suitable and to classify.

If the digit dialing is big, then its movement wristwatch waterproofing, if said in conditional rule many attributes, if is the big dial plate, waterproofing, stopwatch, then by its movement wrist watch.

The pseudo code of the algorithm can be described as

For each attribute,

For each value of that attribute, create a rule as follows:

1. Count how often each class appears

2. Find the most frequent class

3. Make a rule assign that class to this attribute-value.

Calculate the error rate of these rules.

Pick the attribute whose rules produce the lowest error rate.

The supposition in the meteorological data, you hopes to be able the prospective value of play. This OneR (single attribute rule) the algorithm thought finds an attribute use, caused the least prediction error.

For example, consider outlook:

if outlook = sunny then play = no .. makes 2 errors in 5 records

if outlook = overcast then play = yes .. makes 0 errors in 4 records

if outlook = rainy then play = yes .. makes 2 errors in 5 records

for a total of 4 errors in 14 cases. Similarly,

If humidity is high then play = no....makes 3 errors in 7 records

If humidity is normal, then play = yes ....makes 1 errors in 7 records

For a total of 4 errors in 14 cases, other two attributes each produce 5 errors at best, for this reason, has chosen the OneR algorithm in Outlook and a humidity as the one decisive attribute.

The 1R Algorithm Implementation Steps

1R needs to input an example, each kind of several attributes and a class. Its goal infers the attribute value kinds which predict the rule entrusts with. This 1R algorithm selection fullest most and accurate single attribute and the base are decided by to this attribute person.

Analyze separate sector these pairs to be as follows to the continual scope algorithm:

1) According to attribute value tuple.

2) Through different values between to each break point between every pair of different values.

3) Repeat

a. Between the deletion time-gap, predict that the identical kind of break point, delete each break point,

b. Examine the decrease in accuracy which result from removing each break point,

4) Chooses and quantity accurate best break point division curve.

Pros: Rule-based classification is very accurate for small document lists accurately take the rule as the foundation classification. The result always acts according to you what you define, because you write rule, therefore it realizes simply, was understood easy by the human. The new test example may carry on the classification and Dataset effectively may explain easily. Another spot as well as OneR algorithm performance and decision tree.

Cons: Defining rule may list tastelessly with many type large-scale documents. When your documents collection's growth, you possibly need to compile the corresponding more rules.

6.0 Analysis

In this analysis part, with the aim of performance all K-Mean as well as OneR the analysis algorithm has carried on the application, involved Weka classified to have the data set these algorithm performance which two specific domains provided. The test was passes with the K-Mean with the facial Palsy dataset and OneR on poetry dataset. Test K-Mean clustering algorithm with the 'Facial Palsy" dataset but we are just using only 30 attributes and 30 records from it. This test's goal is seeks for each set of data for the most suitable method.

6.1 Comparison of K-Means on "KMeans_Sample" data set (ref: Facial Palsy)

Regarding "facial palsy"dataset, we through use default disposition following diagram test. The entire test's result comparison, we make the decision mutually, this is the most appropriate classification and this data set.

Test Result from implemented our K-Mean Algorithm




Weka Result


7.0 Conclusion

In brief, we already tested and appraise the stage in ours experiment the methodologies approach, we already studied have been very big about the machine learning with Weka. We knew that this classification is simple, but the very effective method in the data mining, its use and utilizes the real world the business environment possibility is infinite. I want saying that if we had the new opportunity to make the data mining the research, we wanted the cluster method experiment which we did. The data mining technology has become the very useful knowledge, comes out in the gain massive data, but has not been able to achieve the knowledge possibility which needs. We need to have the clear understanding data and experiment's suggestion result. Not the full algorithm and the method understanding, we are possibly misled, finally in wrong supposition. Therefore, a good data mining experiment needs to obtain the input to be pure, high quality, with nonbiased mass data ability. Finally, the suitable excavation algorithm and produces the result, from the test rational confirmation's choice is also the very important question. Finally, we possibly must utilize study the knowledge, we obtain achieve our goal effectively; our objective and the goal do not have this useless.

In addition, the decision that the choice is very important obtains the good accurate output for each data set's correct technology. This is also clearly saw in the experiment this proportion division not small data set the forecasting result, is always high because of it effective each algorithm forecast rate of accuracy. Overall, the different classified technology had been proven that the performance trait and the precision different level, has the varying degree effectiveness from the result.

The main purpose of this dissertation was to study the K-mean and the OneR algorithms development. When I have another opportunity, will continue to develop with the small/medium/large-scale data set, this algorithm effectively comparison and test result. From my study in here, I described my code operational practice happily;

I've got Weka testing the experience, the learning experience, and realize the concept from the K-Mean as well as the OneR algorithm technology. The K-Mean cluster is one of the best candidates for computation and the colony calculates permission processing, cannot make one of a huge computer or manual data optimum. I must thank to promote this work the university to carry on the investigation for this report, and has the opportunity to understand this work, also thanks Mr. Insu Song and Ms. Yeli Feng for motivating and the contribution helpful comments.

7.1 Future Work

This forecast will be the data mining concept will be embedded into database's core module, particularly from the Oracle and SQL the Server 2008 commercial intelligence solution, developed from Microsoft Corporation to Oracle data mining past several years. However, the independent application procedure was still the useful concrete data miners. More algorithms will be invented, satisfies the current service and the technical tendency need.