This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
The classification of Data Mining techniques is done based the kind of database to be mined and the kind of knowledge to be discovered. The data mining techniques classification, clustering, association rule and regression are discussed below.
Classification: The classification model can be rule-based, decision-tree based, association-rule based, Bayesian-network based, or neural network based. The data records are partitioned into apparent segments called classes. However, the end user or the analyst should be aware of how classes are defined beforehand. It is necessary that each record in the dataset used to build the classifier already have a value for the attribute used to define classes. As each record has a value for the attribute used to define the classes, and because the end-user decides on the attribute to use, classification is much less exploratory than clustering. The objective of a classifier is not to explore the data to discover interesting segments, but to decide how new records should be classified. Classification is used to assign examples to pre-defined categories. Machine learning software performs this task by extracting or learning discrimination rules from examples of correctly classified data. Classification models can be built using a wide variety of algorithms. Classification categorizes the data records in a predetermined set of classes used as attribute to label each record; distinguishing elements belonging to the normal or abnormal class. Classifications algorithms can be classified into three types : extensions to linear discrimination (e.g., multilayer perceptron, logistic discrimination), decision tree and rule-based methods (e.g., C4.5, AQ, CART) , and density estimators (NaÃ¯ve Bayes, k-nearest neighbor, LVQ).
The amount of available network audit data instances is
large, human labeling is time-consuming, and expensive.
Clustering is the process of labeling data and assigning it
into groups. Clustering algorithms can group new data
instances into similar groups. These groups can be used to
increase the performance of existing classifiers. High
quality clusters can also assist human expert with labeling.
A cluster is 100% pure if it contains only data instances
from one category. Clustering techniques can be
categorized into the following classes: pairwise clustering
and central clustering. Pairwise clustering (i.e., similaritybased
clustering) unifies similar data instances based on a
data-pairwise distance measure. On the other hand, Central
clustering, also called centroid-based or model-based
clustering, models each cluster by its "centroid". In terms of
runtime complexity, centroid-based clustering algorithms
are more efficient than similarity-based clustering
The Association rule is categorically sketched out for analysis of data. Each attribute/value pair is considered as an item by the association rule. The purpose of using association rule is to acquire attribute co-relations from the database table. The data set is sieved through to explore item sets that will likely appear is the data. Association rule mining finds associations and/or correlation relationships among large set of data items . Association rules show attributes value conditions that occur frequently together in a given dataset. Many association rule algorithms have been developed in the last decades, which can be classified into two categories: (1) candidate-generation-and-test approach such as Apriori and (2) pattern-growth approach. The challenging issues of association rule algorithms are multiple scans of transaction databases and a large number of candidates. Apriori was the first scalable algorithm designed for association-rule mining algorithm. The Apriori algorithm searches for large item sets during its initial database pass and uses its result as the basis for discovering other large datasets during subsequent searches.
3.2 Data Mining Tools
Data mining tools are used for the prediction of trends and behaviors and determining interesting patterns in the data . There are a number of open-source data mining tools available which are flexible. Some of the most popular open-source data mining tools are WEKA, YALE, KNIME, Orange, GGobi, R and TANAGRA.
WEKA (Waikato Environment for Knowledge Analysis)
It is an open source data mining environment which was first developed at the University of Waikato in New Zealand. Since its inception in 1992 WEKA is considered as a landmark in the field of Data Mining and Machine Learning . The WEKA environment is a collection of an impressive array of machine learning algorithms and data preprocessing tools. Over the years WEKA has become a widely used tool in research and academia for the purpose of data mining.
For this project WEKA has been chosen over a number of other Data mining tools according to some criteria that it can be used by programmer's as well as by people with no programming back ground. Its components can be accessed by advanced programmers by using Java or Command Line Interface (CLI). For the non-programmers WEKA provides a rich GUI. It is platform-independent.
3.3 WEKA User Interfaces
The WEKA GUI has four buttons, each for a major application as show in Figure 3.
Explorer: It is an environment in WEKA for data exploration. The Explorer environment has a panel-based interface. Each panel corresponds to a data mining task as shown in the Figure 4.
The first panel is the preprocess panel. In the preprocessing stage data preprocessing tools load and transform data. These preprocessing tools are called filters. The source of data can be files, URLs and databases. WEKA's own ARFF format, CSV, LibSVM's format, and C4.5's format are the supported file formats.
The next panel called "Classify" in the Explorer gives access to WEKA's classification and regression algorithms are viewed as predictors of "continuous classes". The data set that has been prepared in the preprocessing stage is used for cross-validation for a particular algorithm to estimate performance.
The third panel cluster provides support to unsupervised data using clustering algorithms. These clustering algorithms are applied on the preprocessed data. The clustering performance is evaluated using the statistics provided.
The fourth panel is the Associate panel. It contains schemes for learning association rules, and the learners are chosen and configured in the same way as the clusters, filters, and classifiers in the other panels.
Figure3 WEKA GUI chooser
Figure 4 WEKA explorer User Interface
The next panel is the Select Attributes. This panel gives the user access to a number of algorithms and evaluation criteria for identifying the most important attributes in a dataset. This panel is designed for exploratory data analysis.
The last panel in the Explorer, called "Visualize", provides a color-coded scatter plot matrix, along with the option of drilling down by selecting individual plots in this matrix and selecting portions of the data to visualize. It is also possible to obtain information regarding individual data points, and to randomly perturb data by a chosen amount to uncover obscured data.
The second graphical user interface in WEKA is the "Experimenter". This interface is designed to facilitate experimental comparison of the predictive performance of algorithms based on the many different evaluation criteria that are available in WEKA. Experiments can involve multiple algorithms that are run across multiple datasets; for example, using repeated cross-validation. Experiments can also be distributed across different compute nodes in a network to reduce the computational load for individual nodes.
The third graphical user interface in WEKA is the "Knowledge Flow". This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning.
The fourth graphical user interface in WEKA is the "Simple CLI". This environment provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.
YALE (Yet Another Learning Environment)
It was first developed at the University of Dortmund. It can work on all operating systems as it is written in Java. It was later renamed as RapidMiner. This tool comes with a GUI which allows samples with nestable operators in XML files.
KNIME (Konstanz Information Miner)
KNIME is a java based tool which runs inside IBM's Eclipse platform.
GGobi is designed for interactive visualization based mining of data. It can be used as a plug-in for other data mining tools.