The Use Of Data Mining Tools I E Weka Rosetta Accounting Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


This chapter will cover the research methodology used in this study. The discussion of this chapter consists of the tools used in this research, and the knowledge model development process. Attempted in this research, is the use of data mining tools i.e. WEKA, ROSETTA which is discussed in more detail in this chapter, and also use JAVA language. The knowledge model development process used for this research consists of four phases: data collection, data preprocessing, mining process and evaluation. Each phase is discussed more in detail in the subsequence sections. The experimental result and criteria which will use to make a comparison of classification methods are also discussed.


Data mining tools have a widely important contribution for anomaly detection in control chart patterns. In this research two data mining tools are used to carry out the experiments. The tools are WEKA and ROSETTA. In addition, we use java language for preparation of the program to preprocessing of the data.

3.2.1 WEKA

WEKA (Waikato Environment for Knowledge Analysis) is a data mining system developed by the University of Waikato in New Zealand that implements data mining techniques using the JAVA language. WEKA is a state of- the-art facility for developing machine learning (ML) techniques and their application to real-world data mining problems. It is a collection of machine learning algorithms for data mining tasks. The algorithms are applied directly to a dataset. WEKA implements algorithms for data preprocessing, classification, regression, clustering and association rules; it also includes visualization tools. The new machine learning schemes can also be developed with this package. WEKA is open source software issued under General Public License (Witten et al. 1999; Witten & Frank 2005).


ROSETTA is a toolkit for analyzing tabular data within the framework of rough set theory. It is designed to support the overall data mining and knowledge discovery process: from initial browsing and preprocessing of the data, via computation of minimal attribute sets and generation of if-then rules or descriptive patterns, to validation and analysis of the induced rules or patterns (David

L. Olson & Dursun Delen 2008).

3.2.3 Java

Java is an object-oriented programming language developed by James Gosling at Sun Microsystems in the early 1990s. The language is very similar in syntax to C and C++ but, in techie terms, it has a simpler object model and fewer low-level facilities, and java is currently one of the most popular programming languages in us.

Java programming language is a general-purpose, concurrent, class-based, object-oriented programming language, specifically designed to have as few implementation dependencies as possible. It allows application developers to write a program once and then be able to run it everywhere on the internet (James Gosling et al. 2000).


The knowledge model is development in four phases (data collection, data preprocessing, mining process and evaluation), which can be considered as an important phase in the process of developing a knowledge model. Each phase is discussed in more detail in the subsequent section. Figure 3.1 shows the flow of development of the anomaly detection and recognition of control chart patterns using classification techniques.

Data collection

Data 1

Synthetic Control Chart Time Series data set (SCCTS)

Knowledge model development process

Mining process

Mining classifier decision tree, RBF networks, SVM, JRiP algorithm and Single Conjunctive Rule Learner

Convert to ARFF format

Split dataset to training and testing

The combination of SAX and PAA

PCA method

Entropy method

Data preprocessing


Figure 3.1 shows the flow of development of the anomaly detection and recognition of control chart patterns using classification techniques.

3.3.1 Data collection

This research chooses a Synthetic Control Chart Time Series data set (SCCTS) from UCI KDD as test time series dataset. This dataset consists of 600 samples of control chart with 60 attributes in each sample, the 600 data samples are divided into six different classes (normal, cyclic, increasing trend, decreasing trend, upward shift, downward shift). The six classes (patterns) were generated according to the six equations given in (Pham and Chan 1998; Pham and Oztemel 1994), as shown in Table 3.1.

Table 3.1: The Synthetic Control Chart Time Series data set (SCCTS)

Data (1-600)

Time stage

No. t1 t2 t3 t 60

1 28.7812 34.4632 31.3381 25.8717

2 24.8923 25.741 27.5532 26.691

3 31.3987 30.6316 26.3983 29.343

4 25.774 30.5262 35.4209 25.3069

5 27.1798 29.2498 33.6928 31.0179

6 25.5067 29.7929 28.0765 35.4907

7 28.6989 29.2101 30.9291 26.4637

8 30.9493 34.317 35.5674 34.523

9 35.2538 34.6402 35.7584 32.3833

595 31.0216 28.1397 26.7303 15.366

596 29.6254 25.5034 31.5978 24.1289

597 27.4144 25.3973 26.46 10.7201

598 35.899 26.6719 34.1911 17.4747

599 24.5383 24.2802 28.2814 17.4599

600 34.3354 30.9375 31.9529 10.1521

Each class of these classes consists of 100 time series, and also the length of each time series is equal to 60. It means 60 numerical attributes as shown in Table 3.2.

Table 3.2: The six classes of control chart


Time Series



Increasing trend

Decreasing trend

Upward shift

Downward shift







A control chart consists of points representing a statistics of measurements of a quality characteristic such as (a mean, range, proportion) in samples taken from the manufacturing process at different time [the data]. The time series is a sequence of points, measured typically at successive times spaced at uniform time intervals which often arise when monitoring industrial processes. Therefore, time series analysis comprises methods and techniques for analyzing time series data in order to extract meaningful statistics and other characteristics of the data.

Control chart patterns are classified as normal / abnormal patterns. Normal patterns always exist in the manufacturing process regardless of the fact that how the product is designed and also how adequately the process is maintained, as mentioned in previous chapters. Unnatural control chart patterns (CCP) which illustrated in this research consist of three types listed below and numerous quality practitioners ascribed their corresponding assignable causes to the following according to (Cheng 1997):

Trend patterns: defined as a continuous movement in either positive or negative direction. Possible causes to this type of patterns are tool wear, operator fatigue, equipment deterioration, and so on.

Shift patterns: defined as a sudden change above or below the average of the process. This change may be caused by number causes such as an alternation in process setting replacement of raw materials, minor failure of machine parts, or introduction of new workers, and so forth.

Cyclic patterns: Cyclic behaviors can be observed by a serial of peaks and troughs occurred in the process. Usually typical causes to the pattern are the periodic rotation of operators, systematic environmental changes or fluctuation in the production equipment.

3.3.2 Pre-processing data

Data are normally preprocessed through data cleaning, data integration, data selection, data transformation and prepared for the mining task in Knowledge Discovery in Databases (KDD) as shown figure 3.2. Advancing statistical methods and machine learning techniques have played important roles in analyzing high dimensional data sets for discovering patterns hidden in it. But the ultra high dimensionalities of such datasets make the mining still a nontrivial task. Hence for attribute reduction/ dimensionality reduction is an essential data preprocessing task for such data analysis, to remove the noisy, irrelevant or misleading features to produce a minimal feature subset.


Initial Data

Preprocessed Data

Target Data


Data mining




Transformed Data

Figure 3.2 The KDD stages (Duhman 2003)

In this section, the data preprocessing of control chart time series dataset is discussed. It involves four main steps: the discretization of raw data using entropy method, the measure of similarity using the PCA method, and the transformation of these data using the combination of PAA and SXA. Entropy - Based Discretization

In this study, the first step is to compute the degree of dispersal of the time series data by using the Entropy measure. The entropy is used to discretize the time series data into a formalized from to be analyzed by PCA. Entropy method is a commonly used measure in information theory. It is used to characterize the impurity of an arbitrary collection of examples. This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or "information content" of messages.

Entropy is a function to measure of variability in a random variable, or more specifically for our case, in the states of a character across species. In addition, the higher a character's entropy, the more evenly the states of the character is distributed across all species. Entropy-based discretization methods are sensitive to changes in the distribution even though the majority class does not change (Witten & Frank 2005). Entropy can give the information required in bits (this can involve fractions of bits). Notice that the Entropy ranges from 0 (all instances of a variable have the same class) to 1 (equal number of instances of each value), at its maximum!), when the collection contains an equal number of positive and negative examples. If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1. (Andrew w. Moore 2003). We measure the entropy of a dataset x, as shown in Equation (3.1)

Where is the proportion of instances in the dataset that take the value of the target attribute, which has n different values. This probability measures give us an indication of how uncertain we are about the data.

In order to seeing the effect of splitting the dataset by using a particular attribute. We can use a measure called Information Gain, which calculates the reduction in entropy (Gain in information) that would result on splitting the data on an attribute A. It is simply the expected reduction in entropy caused by partitioning the set of observations, X based on an attribute A as show in equation 3.2:

Where v is a value of is the subset of instances of X where A takes the value v, and X is the number of instances. Note the first term in the equation for Gain is just the entropy of the original collection X and the second term is the expected value of the entropy after X is partitioned using attribute A. The value of is the number of bits saved when encoding the target value of an arbitrary member of X, by knowing the value of attribute A (Lin & Johnson 2002). PCA method

The preprocessing phase in similarity search aims at dealing with several commonly appeared distortions in raw data, namely, offset translation, amplitude scaling, noise, and time warping. Due to the tremendous size and high-dimensionality of time-series data, data reduction often serves as the first step in time-series analysis. Data reduction leads to not only much smaller storage space but also much faster processing. Time series can be viewed as data of very high dimensionality where each point of time can be viewed as a dimension; dimensionality reduction is our major concern here. (Han & Kamber 2001).

The next step is to compute the similarity by using PCA. PCA is commonly used for time series analyzing. It is a dimensionality -reduction techniques that returns a compact representation of a multidimensional dataset by reduction the data to a lower dimensional subspace. PCA can explain the measurement data via a set of linear functions which model the combinational relationship between measurement variables and latent variables (Chen & Lin 2001). As said the authors Dash et al. (2010) "Principal Component Analysis is an unsupervised linear feature reduction method, for projecting high dimensional data in to a low dimensional space with minimum reconstruction error".

We can say that the main objective of principal component analysis is to identify the most meaningful basis to re-express a data set; the hope is that this new basis will filter out the noise and reveal hidden structure (Shlens 2009). PCA is applied on a multivariate data set, which can be represented as a matrix. In the case of time series, n represents their length (number of time instances), whereas p is the number of variables being measured (number of time series). Each row of X can be considered as a point in p-dimensional space. The objective of PCA is to determine a new set of orthogonal and uncorrelated composite variates, which are called principal components.

The coefficients are called component weights and denotes the ith variable. Each principal component is a linear combination of the original variables and is derived in such a manner that its successive component accounts for a smaller portion of variation in X. Therefore, the first principal component accounts for the largest portion of variance, the second one for the largest portion of the remaining variance, subject to being orthogonal to the first one, and so on. Hopefully, the first m components will retain most of the variation present in all of the original variables (p). Thus, an essential dimensionality reduction may be achieved by projecting the original data on the new m-dimensional space, as long as p. (Karamitopoulos et al. 2010).

In this section, we illustrate the apply PCA to the m-dimensional time series data of length n. The derivation of the new axes (components) is based on Σ, where Σ denotes the covariance matrix of X, to calculate a covariance matrix by using the following equation (According to Tanaka et al. 2005):

Each eigenvalue is ordered as . The eigenvector is represented as [ Then, the ith principal component is calculated by using means of

In our approach, we use the principal component to effectively transform the multidimensional time series data into 1- dimensional time series data. Finally, we obtain 1-dimensional time series data T as follows:

= () + () + ... + () 3.8

PCA dynamically detects the significant coordinates that include characteristic patterns of the original data Tm, because the significance of each coordinate is represented in each coefficient. In addition, the first principal component maintains the largest amount of information of the original data (Heras et al. 1996). So, we can say that the first principal component is a linear combination of the original variables weighted according to the contribution in the original data. The combination of SAX and PAA

In this study, we employ Piecewise Aggregate Approximation (PAA) and Symbolic Aggregate Approximation (SAX) as data representations. The basic idea of (PAA) representation is that it represents a vector expression obtained by dividing a time series data into some segments and calculating the average value in each segment. It is a dimensionality-reduction representation method as shown in Equation (3.9), a time series T = of length n can be represented as w is the number of PAA segments representing time series. is the average value of the segment by a vector is calculated by the following equation:

This means by simplify that in order to reduce the dimensionality from n to w, we first divide the original time series data into w equally sized frames, and secondly compute the mean values for each frame, and a vector of these values becomes the data-reduced representation. The sequence assembled from the mean values is the PAA transform of the original time series as shown in Figure 3.3, where a sequence of length 128 is reduced to 8 dimensions. According to Keogh & Kasetty (2002) we normalize each time series to have mean of zero and a standard deviation of one, since it is well understood that it is meaningless to compare time series with different offsets and amplitudes.

PAA representation of each time series is represented by the vector. Then, 'break points' are determined to transform the vector of w-dimension into a sequence of 'SAX symbols'. Break points provide some equiprobable regions of PAA representation under a Gaussian distribution (Lin et al., 2002, 2003). SAX allows a time series of arbitrary length n to be reduced to a string of arbitrary length w, (w < n, typically w << n). The alphabet size is also an arbitrary integer a, where a > 2.

According to Lin et al. (2003), breakpoints defined as follow:

Definition 1 "Breakpoints are a sorted list of numbers such that the area under a Gaussian curve from to = 1/a (and are defined as -∞ and ∞, respectively)."

The transformed PAA time series data are then referred to SAX algorithm to obtain a discrete symbolic representation. Since normalized time series have a Gaussian distribution, we can determine the "breakpoints" that will produce equal-sized areas under Gaussian distribution curve. These breakpoints may be determined by looking them up in a statistical table. Table 3.3 gives the breakpoints for values of a from 3 to 10.

TABLE 3.3: The SAX Gaussian Distributions


3 4 5 6 7 8 9 10

-0.43 -0.67 -0.84 -0.97 -1.07 -1.15 -1.22 -1.22

0.43 0 -0.25 -0.43 -0.57 -0.67 -0.76 -0.84

0.67 0.25 0 - 0.18 -0.32 -0.43 -0.52

0.84 0.43 0.18 0 -0.14 -0.25

0.97 0.57 0.32 0.14 0

1.07 0.67 0.43 0.25

1.15 0.76 0.52

1.22 0.84


According to Lin et al. (2003), once the breakpoints have been obtained we can discretize a time series in the following manner. We first obtain a PAA of the time series. All PAA coefficients that are below the smallest breakpoints are transformed to the symbol "a" and all coefficients greater than or equal to the smallest breakpoint and less than the second smallest breakpoint are mapped to the symbol "b," etc. Figure 3.3 illustrates the idea. In this figure, note that the 3 symbols, "a", "b", and "c" are approximately equiprobable as we desired. We call the concatenation of symbols that represent a subsequence a word. According to Lin et al. (2003), word defined as follow:

Definition 2 "Word: A subsequence C of length n can be represented as a word as follows. Let alpha i denote the element of the alphabet, i.e., alpha1 = a and alpha2 = b". Then the mapping from a PAA approximation to a word Cˆ is obtained as follows:

Now we have defined our symbolic representation (is the PAA representation is merely an intermediate step required to obtain the symbolic representation) (Lin et al. 2003).


Break point 2

Break point 1




Figure 3.3 time series is discretized by first obtaining a PAA approximation and then using predetermined breakpoints to map the PAA coefficients into SAX symbols. In the example above, with n = 128, w = 8 and a = 3, the time series is mapped to the word baabccbc (Lin et al. 2003).

INPUT: Raw_data(object, attribute)

OUTPUT: SAX (object, attribute)


Calculate entropy for selected attribute

for attribute j= 1 …n {


for object i= 1 …m{

if value(i.j)== value(i+1,j){

num_object=num_object+1; /* Collect all object with similar value




total_entropy= entropy (num_object, total_object);


Calculate PCA for Entropy (object, attribute)

Data=matrix (M,N)

for attribute j=1, …, n{

for object i=1, ….,m{

mn=mean for attribute j

data= data-mn /*subtract off the mean for each dimension

/*matrix (M, N) deduct by mean of the object in attribute X


/* calculate the covariance matrix

Covariance= 1/(N-1)*data*data

[PC, V]= eigenvalue (Covariance) /* find the eigenvectors and eigenvalues

V=diagonal (V) /*extract diagonal of matrix as vector

[junk, rindices]= sort (-1*V) /*sort the variances in decreasing order

V=V (rindices)

PC=PC(:, rindices)

Signals= PC ̀ *data /*project the original data set

PCA (object, attribute) = signals

Calculate PAA for PCA (object, attribute)

Data= matrix (M, N)

for attribute j=1, …, n{

for object i=1, ….,m{

Z= (data-mean(data))/std(data);

PAA(object, attribute) =Z



Calculate SAX for PAA (object, attribute)

Data= matrix (M, N)

for object i=1, ….,m{

for attribute j=1, …, n{

for value < [-0.48] change to 1

[-0.84] ≤ value < [-0.25] change to 2

[-0.25] ≤ value < [0.25] change to 3

[0.25] ≤ value < [0.84] change to 4

[0.84] ≤ value change to 5



ALGORITHM 1: The Time Series Control Chart Data Preprocessing Algorithm

3.3.3 Mining process

The mining process is aimed at obtaining the best knowledge model of anomaly detection in control chart patterns using classification. In figure 3.1 the mining process phase consists three steps which are, convert to ARFF format, Split dataset to training and testing and mining classifier. The control chart dataset had been prepared and processed before it was split into two sets of data, training and testing datasets. The dataset was converted to ARFF format because WEKA mining tools support it. Features extracted for dataset is mined using five popular mining techniques such as decision tree, support vector machine, BRF networks, JRiP algorithm and Single Conjunctive Rule Learner. The result of the knowledge model is then compared between these techniques based on accuracy detection and averages of errors with time taken to build model. Split data into training and testing set

In order to get a good model prediction, the data must be split/divided into training and testing. Referring to Han & Kamber 2001, data splitting is the important techniques to predict the accuracy of the developed classifier. The splitting process ensures that all the data will be trained and tested. To split the data into training and testing, there are good methods that can be used such as and holdout (percentage split) and like k-fold cross validation.

The method used in this research is the like k-fold cross validation. The like k-fold cross validation method is suitable for various sizes of sample data. Hertzamann (2007) mentioned that method will make the data to be used efficiently. The method will produce different k-folds and each fold will be divided into k-1 models. Each model will have different training and testing data.

In this research 10 random/folds using Excel are used to recognition the anomaly detection in control chart patterns. Each fold consists of 9 models which resulted from the splitting of the data in the form of ratios (training set: testing set) like 90:10, 80:20, 70:30, 60:40 and so on. The training dataset was used to develop the model and the testing dataset was used to test the detection accuracy and error rate of the model. The following figure 3.4 shows the division of data. The data will be mined using classification algorithms.

Training & Testing data




Accuracy detection of anomaly in control chart patterns


Cleaned data







Figure 3.4 10 cross- validation process

The ROSETTA tool is used to split the data to (training set: testing set). ROSETTA automates the training and testing process in grouping. It is eased the dividing process to randomly divide data into training and testing in ratios. Mining classifier

Machine learning techniques have recently been extensively applied to control chart classification. Machine learning covers such a broad range of processes that it is difficult to define precisely. A dictionary definition includes phrases such as to gain knowledge or understanding of or skill by studying the instruction or experience and modification of a behavioral tendency by experienced zoologists and psychologists study learning in animals and humans (Nilsson 1999).

Machine learning techniques can be used in a variety of data mining techniques solutions such as classification, clustering and others. Classification is probably the one most popular machine learning problem. In a classification task in machine learning, the task each instance of dataset and assign it to a particular class. According to Osmar (1999), classification analysis is the organization of data in given classes. Also known as supervised classification, the classification uses given class labels to order the objects in the data collection. Classification approaches normally use a training set where all objects are already associated with known class labels. The classification algorithm learns from the training set and builds a model. The model is used to classify new objects. The five popular classification algorithms used in this research, decision tree, support vector machine, BRF networks, JRiP algorithm and Single Conjunctive Rule Learner to process control data. Each technique is discussed more in detail in this section.

Decision Tree

Decision tree learning is one of the most successful learning algorithms, due to its various attractive features: simplicity, comprehensibility, lack of parameters, and its being able to handle mixed-type data. In decision tree learning, a decision tree is induced from a set of labeled training instances represented by a couple of attribute values and a class label. Because of the vast search apace, decision tree learning is typically a greedy, top-down and recursive process starting with the entire training data and an empty tree. Decision trees are among other things, easy to visualize and understand and resistant to noise in data (Witten & Frank 2005). Commonly, decision trees are used to classify records to a proper class. Moreover, they are applicable in both regression and associations tasks. An attribute that best partitions the training data is chosen as the splitting attribute for the root, and the training data are then partitioned into disjointed subsets satisfying the values of the splitting attribute (Su & Zhang 2006).

Support Vector Machine

Support vector machine (SVM) is considered one of the most successful learning algorithms proposed in the applications classification, regression and novelty detection tasks. According to Chang & Chang (2007), SVM is a powerful machine learning tool that is capable of representing non-linear relationships and producing models that generalize well to unseen data. The basic concept of an SVM is to transform the data into a higher dimensional space and find the optimal hyperplane in the space that maximizes the margin between classes. The main object of applying SVM for solving classification problems includes two steps: first, SVM transforms the input space to a higher dimensional feature space through a non-linear mapping function. Secondly, it has margin of separation improves the generalization ability of the rustling classifier (Burges 1998).

Radial Basis Function

Radial basis function (RBF) networks have a static Gaussian function as the nonlinearity for the hidden layer processing elements. The Gaussian function responds only to a small region of the input space where the Gaussian is centered (Buhmann 2003). The key to a successful implementation of these networks is to find suitable centers for the Gaussian functions (Chakravarthy and Ghosh 1994; Howell, A.J. and Buxton H. 2002). The simulation starts with the training of an unsupervised layer. Its function is to derive the Gaussian centers and the widths from the input data. These centers are encoded within the weights of the unsupervised layer using competitive learning (Howell, A.J. and Buxton H. 2002). During the unsupervised learning, the widths of the Gaussians are computed based on the centers of their neighbors. The output of this layer is derived from the input data weighted by a Gaussian mixture. The advantage of the radial basis function network is that it finds the input to output map using local approximators. Usually the supervised segment is simply a linear combination of the approximators. Since linear combiners have few weights, these networks train extremely fast and require fewer training samples.

JRip (Extended Repeated Incremental Pruning)

JRip implements a propositional rule learner, Repeated Incremental Pruning to Produce Error Reduction (RIPPER). It was proposed by William W. Cohen (1995) as an optimized version of IREP. Ripper builds a ruleset by repeatedly adding rules to an empty ruleset until all positive examples are covered. Rules are formed by greedily adding conditions to the antecedent of a rule (starting with empty antecendent) until no negative examples are covered. After a ruleset is constructed, an optimization postpass massages the ruleset so as to reduce its size and improve its fit to the training data. A combination of cross-validation and minimum-description length techniques is used to prevent overfitting.

Single Conjunctive Rule Learner

Single conjunctive rule learner is one of the machine learning algorithms and is normally known as inductive learning. The objective of rule induction is generally to induce a set of rules from data that captures all generalizable knowledge within that data, and at the same time being as small as possible (Cohen 1995). Classification in rule-induction classifiers is typically based on the firing of a rule on a test instance, triggered by matching feature values at the left-hand side of the rule (Clark & Niblett 1989). Rules can be of various normal forms, and are typically ordered; with ordered rules, the first rule that fires determines the classification outcome and halts the classification process.

3.3.4 Evaluation

To gauge and investigate the performance on the selected classification methods or algorithms namely decision tree, support vector machine, BRF networks, JRiP algorithm and Single Conjunctive Rule Learner. The performance of the knowledge model is evaluated based on detection accuracy and average of error. The robustness result of the model is evaluated by performing the 10 fold cross validation process. The obtained model with highest detection accuracy and lower error rate was chosen as the best knowledge model.

3.3.5 Conclusion

The methodology of recognition the anomaly detection in control chart patterns was introduced in this chapter. The methods introduced can be seen step by step in figure 3.1. The data used in this research collected from UCI KDD Synthetic Control Chart Time Series data set (SCCTS) is shown in this chapter. For data preprocessing, this chapter shows four steps; the first step reduced using entropy technique, and second similarity measure using the principle component analysis (PAA). Then the piecewise aggregate approximation (PAA) and symbolic aggregate approximation (SAX) are used as data representation. In mining process decision tree, support vector machine, BRF networks, JRiP algorithm and Single Conjunctive Rule Learner are used to produce high detection accuracy with low error rate and with the total time taken to build the model.