The Advanced Data Mining and Knowledge

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Data mining is a relatively new computing and contemporary themes. However, statistical data mining technology used in many older machine learning and pattern recognition. Although data mining is a relatively new innovation, which improves the traditional data analysis to provide on-site to see rapid expansion. In view of today's data-rich in the climate, the major research topics of useful information efficiently and accurately convey the key requirement to continue.

Clustering is a basic technology used in the data, one of a series of applications by mining tool. It provides several algorithms to evaluate large data sets based on specific parameters and set of related data points.

In addition, the data associated company is one of the technologies used, through a series of application mining tool. It provides a number of algorithms can evaluate small, medium and large data sets of input settings.

This is on the implementation of two widely used algorithms, is K - means (network) and Apriori (Associate) files. The initial testing technology using a standard experimental manual calculation using Excel and then I used the Java language implementation of each algorithm. Also I have used Weka to compare the results.

Further experiments and testing possible explanations put forward to improve these methods. Clusters and a variety of quasi-critical applications are specific, as well as several aspects of future work suggestions.


Data mining from a large database of valuable business information search for similarities (such as its name, was found with the gigabytes of data storage products, scanners) and the mining of valuable pulse Hill. (D Alexander, 1996) are the two processes, or through a huge amount of material, or a smart probe to find exactly where it's worth living in carding.

Data Mining

Data mining can be considered software and application technologies, into the information data, and the commitment to achieve a data warehouse. It has been defined as "hidden, non-trivial extraction of previously unknown and potentially useful data" (W. Frawley and G. Piatetsky, Shapiro and C. Matthews, 1992) and "large amounts of data from to extract useful information of scientific or database." (Four hands, you Mannila, sports, although Smith) is often used for data analysis, such as artificial intelligence, data mining, is a general term, is a different meaning, with a wide range of backgrounds. It sometimes is called a database of knowledge discovery (KDD), and also provides the tools; users can scour through the data and uncover anomalies. (Data mining context - D2K simplification) these findings may highlight the relevance of the former undetected, influence strategic decisions and identify new assumptions require further investigation.

Data warehouse users may feel that they are "data rich, information poor" or "data flood, but the lack of information." Data rich and information poor phenomenon is quite common in the real world. Therefore, the challenge to make is to become the information and data, and then act on information. In today's competitive business environment, enterprises need to rapidly move into their tremendous insight. An appropriate solution (such as rapid miner, Weka, etc) is known as data mining challenges.


Aggarwal et al. defined as the process of clustering: "Taking into account a number of points in multidimensional space, find a partition of points into clusters so that points in each cluster are close to one another" (C. Aggarwal, J. Wolf, P. Yu, C. Procopiuc, and J. Park , 1999). Closeness is measured using various algorithm specific metrics, for example, that the closer any two points to one another, the more strongly they are connected. This process results in the groupings of such units, where a strong inter-cluster and weak intra-cluster relations exist between and among points, for example Figure 1.

Figure 1: Example for clustering points representation

Clustering can power the machine learning terms as a form of unsupervised learning, that is, clusters are representative of patterns hidden in the source data set (P. Berkhin). Raw information is analyzed and found to have ties algorithm without external direction or interference, learning through observation rather than study examples. Besides, this translates to objectivity as an effective means of data analysis, without the possibility of human subjective conclusions drawn from data.

Partitioning clustering:

Separation methods defined clusters by grouping data points into k partitions, defined by user at the time the process is carried out. A matter to be determined within the framework of such other points in his piece, and varied the points that lie outside the boundary of that division (J. Han and M. Kamber, 2006 (i). Comparison is based on the features of the data provided. Thus, the algorithms rely on conversion data and semantic attributes (width, height, shape, color, or other expenses) of points determines the physical location of a number of mathematical axes. This provides an objective and computationally acceptable framework for analysis. The simplest case there are only two attributes, and thus provides a conversion point on the standard Cartesian plane. This process is greatly complicated when, as often occurs in more high definition source, hundreds of attributes are present. The provision of aircraft to take its high dimensionality, and complexity of the analysis becomes very computationally expensive.

Centroid clustering:

Centroid is distance of two clusters. Suppose the two clusters to be formed for the observations listed Table 1. We begin by arbitrarily committees, a, b and d to cluster 1, and (c) and (e) in Cluster 2. The cluster centroids are calculated as shown in Figure 2 (Part a). The cluster centroid is the point with coordinates equal to the average values of variables for the cluster observations. Thus, centroid of cluster 1 at point (X1 = 3:67, X2 = 3:67), and that cluster 2 points (8.75, 2). Two centroids are marked C1 and C2 in Figure 2. That cluster's centroid, therefore, can be considered the center of the cluster observations, as shown in Figure 2 (Part b).

Figure 2: Centroid representation

Table 1: Sample Data Illustration for Centroid Explanation

Association Rule

Association rule mining finds interesting associations and / or correlation relationships among a large set of data points. Association rules shows attribute value conditions that occur frequently together in a given dataset. A typical and widely used example of association rule mining is Market Basket Analysis.

Association rule mining is one of the most important procedures in data mining. Industry applications, often more than 10,000 rules are found. Allow manual inspection and support knowledge discovery the number of rules must be substantially reduced by techniques such as pruning or grouping. In this paper, we present a new normalized distance metric group association rules. Based on these distances, the agglomerative clustering algorithm is used to cluster rules. Well as the rules are embedded in the territory of a vector by multi-dimensional scaling and clustered using self-organizing feature map. Results are combined for visualization. We compare different distance measures and illustrate the subjective and objective cluster purity results obtained on real data definition.

Figure 3: Hierarchy Association Rule example by using computer items.

Related Work

K-mean and Apriori discussed many textbooks. There are numerous applications of K-mean clustering, range from unsupervised learning of neural network, Pattern recognitions, classification analysis, artificial intelligent, image processing, machine vision, etc. In principle, you have several objects and each object is a few attributes, and to to classify objects based on attributes, then you can apply this algorithm. (Wikipedia)

Similarly Apriori association rule is used to diabetic data analysis; rank the importance of Health database, Logistic regression, etc. For efficiency, it does the filtering based on support with the appeal Association Rule Mining.

I developed a java program for K-mean Cluster Centroid and Apriori Association Rule. (1) K-means that an application to the number of iteration process and all the attributes, and distance from the cluster 1 and cluster 2. (2) Apriori appeal to the most frequent item-set and association rules between nodes.

Data Sets

K-Mean Dataset:

Figure 4: Ionosphere Dataset information from

Figure 5: Ionosphere Dataset Sample Data

Apriori Dataset:

Figure 6: Sample dataset which used in the lecture


K-Mean Cluster

The goal of cluster analysis to assign views in groups (clusters) so that the observations within each group are similar to each other variables or attributes of interest, and they stand in groups in addition to another. In other words, aim is to divide the observations into homogeneous and distinct groups.

There are many other clustering methods. For example, a hierarchical method of dividing the following reverse order that begins with a cluster consisting of all observations forms the next 2, 3 and so on clusters, and ends, as many clusters as there are views. This is not our intention to examine all the clustering methods (Everitt, 1993). We do not want to describe, but, for example, non-hierarchical clustering method, the so-called, k-means method. Its simplest form, the K-means method follows the following steps:

Step 1: Specify the number of clusters and, arbitrarily or intentionally, the members of each cluster.

Step 2: Calculated for each cluster 's centroid (explained below), and distances between each observation and the centroid. If the observation is nearer the centroid of a cluster of other place, which now belongs to, once again ordered the nearer cluster.

Step 3: Repeat Step 2 until all the observations are near the cluster centroid to which they belong.

Step 4: If the number of clusters can not be specified with confidence before, repeat steps 1, 3 with different number of clusters and evaluate the results.

Suppose the two clusters to be formed for the observations listed Table 1. We begin by arbitrarily committees, a, b and d to cluster 1, and (c) and (e) in Cluster 2. The cluster centroids are calculated as shown in Figure 2 (a). The cluster centroid is the point with coordinates equal to the average values of variables for the cluster observations. Thus, centroid of cluster 1 at point (X1 = 3:67, X2 = 3:67), and that cluster 2 points (8.75, 2). Two centroids are marked C1 and C2 in Figure 2 (a). That cluster's centroid, therefore, can be considered the center of the cluster observations, as shown in Figure 2 (b). We now calculate the distance between, and the two centroids:

Equation 1: Calculate distance between a and two centroids as per Table 1

Follow that the closer to the centroid of cluster 1, which is currently designated. Next, we calculate the distance between b and the two cluster centroids:

Equation 2: Distance between b and two cluster centroids

Since b is closer to cluster 2's centroid than the cluster 1, is reassigned to cluster 2. New cluster centroids are calculated as shown in Figure 2 (a). New centroids are plotted in Figure 2 (b). The distances to cluster centroids new observations are as follows (an asterisk indicates that the nearest centroid):

Table 2: Distance Table from sample

Each observation belongs to the cluster centroid to which it is coming, and K-means method stops.

What are the weaknesses of K-Mean Clustering?

Similar algorithm, K-mean clustering has many weaknesses.

When the numbers of data are not so much the initial grouping determines the cluster significantly.

The number of cluster, K, must be decided before hand.

We will never know the real cluster, using the same data, if it is inputted in different ways can produce different cluster if the number of data in a few.

We will never know that the feature contributes more to the process of unification, because we assume that each attribute has the same weight. One way to overcome these obstacles by using K-mean clustering only if there are many existing data.

Apriori Association

Apriori-T (Apriori general) is an Association Rule Mining (ARM) algorithm, developed LUCS-KDD research team, which uses the "reverse" the enumeration tree, where each level of tree is defined in terms of the Array (i.e T-tree data structure is a form of Tree). The Apriori-T algorithm, in fact, developed as part of a more complex ARM algorithm --- Apriori-TFP (Apriori general from partial). Two algorithms are described in Coenen and Leng (2004).

Figure 7: Apriori Database representation

As a database, D, which consists of 9 transactions, assume min. support count required is 2 (ie min_sup = 2 / 9 = 22%) and minimum confidence is 70% necessary. We must first find out the frequent itemset using the Apriori algorithm. Then, Association rules will be generated using min. support and min. confidence.

Figure 8: 1-itemset frequent Pattern

Defined frequent 1-itemsets, L1, is composed of candidate 1-itemsets satisfy the minimum support. The first iteration of algorithm, each item is a member of the candidate.

Figure 9: 2-itemset frequent pattern

Identify the frequent 2-itemsets, L2, L1 using the algorithm Join L1to cause of candidate 2-itemsets, C2. Next D transactions are scanned and the support count for each candidate itemset is C2 is accumulated (as shown in the middle table). Defined frequent 2-itemsets, L2, is then determined, which consists of the candidate 2-itemsets is C2 having minimal assistance..

Figure 10: 3-itemset frequent pattern

The generation of the candidate 3-itemsets, C3, includes the use of Apriori property. In order to find the C3, we compute L2JoinL2. C3 = L2 JoinL2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Join now step in and complete prune step will be used to reduce the amount of C3. Prune step helps to avoid the heavy computation due to large Ck.

Based on Apriori propertythat all subsets of frequent itemset should be frequent, we can determine that the four latter candidate may not be frequent. For example, allows you to take {I1, I2, I3}. 2-item subsets that are {I1,I2}, {I1, I3} and {I2, I3}. As all 2-item subsets of {I1, I2, I3} members, L2, we continue to {I1, I2, I3} in C3. Allow you to take another example: {I2, I3, I5} which shows how the pruning is done. 2-item subsets are {I2, I3}, {I2, I5} and {I3, I5}. But, {I3, I5} is not a member of L2, and therefore it does not frequent violation of Apriori property. So we need to remove {I2, I3, I5} - from C3. Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking the result of all the members Join operation for Pruning. Now, D transactions are scanned to determine the L3, which consists of the candidate 3-itemsets is C3having minimal assistance.

That the algorithm uses the L3 Join L3 to cause the candidate to 4-itemsets C4. Despite join results {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. Thus, C4 = φ, and the algorithm terminates, having found all the frequent items. This completes our Apriori Algorithm.


Weka - Datamining Software

Weka (Waikato environment for knowledge analysis) is a popular suite of machine learning software written in Java, developed by University of Waikato, New Zealand. WEKA is free software available under the GNU General Public License. (Wikipedia)

The Weka workbench contains a collection of visualization tools and algorithms, data analysis and predictive modeling with graphical user interface for easy access to this functionality. Original non-Java version of Weka was a TCL / TK front-end (mostly third party) modeling algorithms implemented in other programming languages, plus data preprocessing utilities in C, and Makefile-based system for the proposed machine learning experiments. This original version was primarily designed as a tool for data analysis for agricultural web sites, but more recent fully Java-based version (Weka 3), for which development began in 1997, is currently being used in many different application areas, particularly educational and research purposes. Weka is the main force that it is.

is freely available under the GNU General Public License-in,

very portable (mobile), as it is fully implemented in Java programming language and it runs on almost any modern computing platform,

contains a comprehensive collection of data preprocessing and modeling techniques, and very easy to use because of the newfangled graphical user interface, it contains.


Year 1993, the University of Waikato, New Zealand began developing the original version of Weka (which became a mix TCL / TK, C, and Makefiles).

Year 1997, in the decision was made to redevelop Weka from scratch in Java, including implementations of modeling algorithms.

Year 2005, Weka gets SIGKDD Data Mining and Knowledge Discovery Service Award

Year 2006, Pentaho »Corporation acquired the exclusive license to use Weka for business intelligence. It makes data mining and predictive analysis component of the Pentaho business intelligence suite.

All-time high on as of 2009-06-11, 246 (with 1.566.318 downloads)




The cluster uses an eager learning approach to build the K-Mean cluster algorithm. The program which I developed using Eclipse -JAVA SDK environment is easy experience. I have named my class as "k_meansCluster" which consist of 7 functions including main which allow executing the program in Java application environment.

Figure 11: K-Mean Source code screen

The function StartProcess is core function to complete k-Mean algorithm, the steps involves in the program

Core class has few static and dynamic variable declarations, such as K value (I have assigned k value as 2), Input data set Row, Col and Input dataset Filename.

The main() function create instance of K_meansCluster() instance and initiate the process by calling StartProcess() function.

StartProcess() function start the process by reading input data set [ReadFile() function] then normalize the dataset by using DataNormalize() function.

Once dataset normalized then Calcuate() function starts, this function call another 2 function CentroidCalc() and Distance() - which does the calculation as explained in the example above

Then back to StartProcess() function and display all Attributes, Cluster0 and Cluster1.

Also the program gives number of iteration.


Another eager and easy learning approach is the Apriori Associate algorithm. The program which I developed using Eclipse -JAVA SDK environment is easy experience. I have named my class as "My_Apriori" which consist of 16 functions including main which allow executing the program in Java application environment.

Figure 12: Apriori Source code Screen

The function run is core function to complete k-Mean algorithm, the steps involves in the program

Core class has few static variable and datset declaration in the top, such as min support, min confidence and trans_set

My_Apriori() constructor makes frequent and candidate set when new instance created

The main function create new instance of My_Apriori class and call the run() function

run() calls item1_gen() function for creating frequent set by doing iteration from candidate set

then candidate_set() and frequent_set() process happens till frequent is empty

once node are associated candidate each step of Candidate and Frequent set displays

The program gives Maximal frequent Item and Association rules between the candidates.


K-Mean Result Comparison

Result using ionosphere dataset:

Figure 13: K-Mean Result

The Figure 12 represent the result from my program and weka. The result is not same but nearly acceptable by reading many online resources.

Apriori Result Comparison

By using sample Dataset :

Sample1: "abcdf","bcdfgh","acdef","cefg".

Figure 14: Apriori Result Sample 1

Sample 2: "134","235","1235","25"

Figure 14: Apriori Result Sample 2


The main goal of this dissertation was to study the development of K-means and Apriori algorithms. When I get another opportunity then will develop this algorithm effectively and test with small/medium/large dataset by comparing results. Evaluates the performance of data mining techniques, predictive accuracy in addition, some researchers have highlighted the nature of explanatory models and to reveal patterns are valid, novel, useful and can be more important to understand and explainable (C. Matheus, P. Chan, and G. Piatetsky- Shapiro, 1993).

Learn learnt:

Weka testing experience

K-Mean, Apriori Algorithm techniques and implementation concepts.

K Mean Clustering is one of the best candidate for computing

Cluster Computing allows processing huge data that can not be done in a computer or manually.

Apriori is one of the best Associate rule and easy to implement

I wish to thank the University to make the report of the investigation and the opportunity to learn this work, also thankful to Mr. Insu Song and Ms. Yeli Feng for motivating this work and contributing helpful comments.