Data mining is the branch of computer science which deals with the process of extracting valuable information from a heap of raw data. The research will discuss Data Mining Tools. Data Mining Tools are software element and techniques which accept users to citation all the information from the data. It has ability to collect large quality of data. The very common field like Marketing, Fraud protection and more are familiar with data mining tools. The research goes through Data Mining Tools integrating with Genetic Algorithms.
How Genetic Algorithms and Fuzzy System will helpful to solve huge range of problems that have been complicated to solve with classical approaches. Here the research will discusses all kinds of data mining problems such as classification, clustering, pattern mining and regression. The research go through those problem and find out which tool has a ability to solve all the data mining problems and provide suitable results. The research also includes different authors view about their research.
The research will be presenting non-commercial Java Software Tool named DSTA (Data Mining Software Tools integrating with Genetic Algorithms). The work introduces software tool DSTA to assess evolutionary algorithms for Data Mining problems. It also includes a big collection of genetic fuzzy system algorithms based on different approaches.
Aim of the research is to study the use of Genetic Fuzzy Algorithms to introduce an effective data mining tools that is used to solve maximum problems of data mining.
The project is presenting a non-commercial Java software tool named DSTA (data mining software tool using genetic algorithms). This tool empowers the user to assess the behavior of Evolutionary algorithms for different kinds of Data Mining (DM) problems are:
Classification:- is a D.M method used to calculate group relationship for record. Most of the classification method consists of decision tree and neural networks. e.g.: classification use to calculate the climate for the exact day will be sunny, rainy or cloudy.
Clustering:- is another data mining technique which is used to resolved the huge amount of data is collected into bundle of smaller group of related data. e.g.:
Pattern mining:- is used to discover current patterns in data. e.g.: a supermarket sold hundreds of items in a day. Pattern mining help to find out how many customers bought the same item at same time like this way "coke => chips (80%)" means four out of five customers.
Regression:- is used to calculate values. It is mostly exhausted for prediction and forecasting.
This software tool provides a user friendly graphical interface in which experimentations containing multiple datasets and algorithms connected among to perform easily. The other purpose of the research is that DSTA is an educational and research tool along with combination of evolutionary education method and altered pre-processing method.
First deeply understand about data mining tools before discussing the data mining software tools.
1.3. What are Data Mining Tools?
This was discussing by Silltow. John (2006) Data Mining is the finding of unanticipated design, restricted data and different rules in outsized database. DM is a strong technology which helps the companies to extract the vital information from their database with imaginable way. The traditional statistical techniques are unable to handle large database. Data Mining Tools expect upcoming developments to admitting business for different motivated decisions. Data Mining Tools is exercised for solving actual world problems such as in business, engineering and science. Most of the companies' already gain profit to sort-out and process massive qualities of data.
Development Steps/ Year
Computers, Tapes, Disks
Traditional, still information delivery.
Relational Database (RDBMS), SQL,
Oracle, Sybase, Informix, IBM, Microsoft.
Traditional, Dynamic information delivery at record level.
Data Warehousing & Decision Support
On-line analytic processing(OLAP), Multidimensional databases, Data Warehouses
Pilot, Comshare, Arbor, Cognos, Micro-strategy
Traditional, Dynamic information delivery at multiple level.
Advanced Algorithms, Multiprocessor computers, Massive Databases
Pilot, Lockheed, IBM, SGI, numerous startups
Approaching, Practical information delivery.
Table1. Steps for Data Mining developments.
1.3.1. The best techniques in Data Mining are:
Artificial Neural Networks: ANN is also recognizing by name Neural Network or Neural Net. It has shaky characteristic that gained knowledge from training and look alike biological neural network formation.
Decision Trees: It contains tree-shape in information that stand for normal decisions. Decision Tree includes two methods those are:- Classification and Regression tree (CART) and Chi Square Automatic Interaction Detection (CHAID).
Genetic Algorithms: GA has optimization techniques that help to developing genetic grouping, alteration, impartial selection in a structure based on the idea of Evolution.
Rule Induction: The objective of Rule induction is to extract useful information from database.
1.3.2. Data Mining Tools: -
There are some data mining tools available in the market with their own asset and drawbacks. It can be differentiate into three types:
Retrospective Data Mining Tools: This data mining tools help business to create data models and development for large amount of difficult algorithms and methods. This kind of tools support in Windows and UNIX version to observe data and highlight tendency.
Dashboards: This type of tools put in computers to observe data in a database. It indicates information alteration and keeps informed. It would help user to see their business performance in the form of tables and chart.
Text Mining Tools: the third category of data mining tools is text mining tools. It has facility to store data into another kind of text such as from MS word and PDF files convert to normal text file.
Further discuss one of the data mining technique that is Genetic Algorithm:-
In 2008 Herrera discusses about the fuzzy system and genetic algorithms as Fuzzy systems are unique central parts for the use of the Fuzzy Set Theory. They form a major area of fuzzy rule based structure known as FRBS's, which create an extension to typical rule based structure.
Genetic algorithms (GA) are the most popular and commonly used universal search methods with the ability to discover a huge examination space for proper solutions. Some of the features of the genetic algorithms like the generic encryption formation and independent working task, expertise drawn out the use of GA's in the progress of a huge amount of methods like the designing FRBS's over the last few years.
While Genetic fuzzy systems are controlling for answering a wide variety of logical problems in order to use them, the basic reliable programming knowledge also consume a huge amount of time and work to create a computer program implementing the trendy algorithm according to the needs of the user. The effort can be dull and needs to be done before users can start focusing their interest on the subject that they should be really working on. In the last few years, many fuzzy logic software tools have been developed to reduce this task.
2. Literature Review
2.1. Literature Review:
To understand the problems and the solutions of those problems in data mining to develop a software tools the research goes through and considers the different points of views for different researcher.
According to Charles X Ling and Chenghui Li (1998, pp. 1-7) "Data Mining for Direct Marketing: Problems and Solutions." They discuss two approaches in their article those are Mass Marketing and Direct Marketing to promote and publicity. They suggested Direct Marketing because of data mining is very useful for direct marketing. They state two problems one of them is different class sharing and another is if reasonable model is created but projecting truthfulness is not appropriate for data mining process then. So, they suggest number of learning algorithms to solve the above problems.
Hongjun Lu et al. (1996, pp. 957-961) discuss one of the major problem of data mining is "classification". The research methodology used to find classification rules with the help of neural networks. The researcher illustrate that the data mining method includes three levels:
Network Construction and Training: - First level looks on the theory and development of three layer neural networks.
Network Pruning: - This level is help to delete the unwanted relations and groups without growing the classification error percentage of the network.
Rule Extraction: - the third level mentions the classification rules with the help of pruned network.
They also show some experimental output to clearly identify the success of the proposed approach. They work on one of the important method called rule extraction.
John F. Elder and Dean W. Abbott (1998, pp.1-31) in their research "A comparison of leading data mining tool" estimated several important and business related data mining tools and accomplished data cruncher, PRW, OLPARS are the greatest tool on the other side Clementine and Darwin were classified as ordinary. A comparison of the algorithms was also handling where the decision tree algorithms appear as the very often used even though linear and statistical algorithms was a close challenger. They also focused towards some of the open source data mining tools available in market.
Jesus Alcala-Fdez et al. (2008, pp.83-88). They present a software tool that is known as KEEL. This tool is very helpful to solve various data mining problems. They show how KEEL software tool evaluate evolutionary algorithms for all problems of data mining. The research show step by step procedure to use this software tool. They also discuss case study of online and offline module. The research also comparing some non-commercial data mining software tool to make it easily understand for the user it advantages or drawbacks.
Heikki Mannila (2001, pp.1-15) discuss in his article "Methods and Problems in data mining" the knowledge innovation in database and how to find particular or important information from huge amount of data. He find-out numerous open research problems in data mining.
F, Herrera (2008, pp.27-46) states the development of genetic fuzzy systems (GFS) is nothing but the hybridization between the fuzzy logic and genetic algorithms. A GFS is nothing but a fuzzy system enlarges by a knowledge procedure based structure which can include things like genetic algorithms, genetic programming techniques, and implementation of evolutionary algorithms. Fuzzy systems are unique for an important part of request of fuzzy set theory; it is a form of model structure for Fuzzy Rule Based System.
Oscar. Cordon (2007, pp. 1-166) "Genetic Fuzzy System; Fuzzy knowledge Extraction by Evolutionary Algorithms." The research article is totally based on Genetic Fuzzy Systems and its approaches. The genetic algorithms is used to gave a shape to fuzzy system to represent for soft computing; example Genetic Fuzzy Systems. The most popular approach used in GFSs is Fuzzy Rule Based Systems (FRBSs). The research also includes some diagram and flowcharts to present the working of GFSs.
W. Abbot, I. Phillip Matkovsky and John.F (2002, pp.1-6) discuss in their paper "An Evaluation of High-End Data Mining Tools for Fraud Detection." Data Mining Tools are mostly used to find out real world problems in such fields like engineering, science and business. They talk about the latest widespread development of data mining tools for fraud detection, also figure out the tool selection process and product evaluation those are as follow:
Enterprise Miner (EM).
Intelligent Miner for Data (IM).
Pattern Recognition Workbench (PRW).
The research are comparing the tools to find out how those tools are different from one another and also show the advantages and disadvantages in the case of fraud detection. The research also includes the hardware and software compatibility for the product. Also discuss some algorithms that is used in data mining by those product.
2.2. Existing System:
SPSS is an advanced D.M toolkit. It provides the permission to user to organize their personal data mining. It is used in a collection of technical correction. It consists of two features first is statistical platform and other is SPSS language. SPSS works in 3 basic steps: data, syntax and output file. It shows data in spreadsheet layout.
Drawbacks of SPSS are:
SPSS presenting a commercial Java software tool. So, it is cost affective.
Its license is absolutely unfriendly.
Default Graphics are weak and difficult to modify.
And regularly face compatibility problems with previous edition.
Although there are a variety of fuzzy logic tool available these days like the MATLAB and the fuzzy logic tool box. MATLAB is also known as Matrix Laboratory. It provides statistical computing atmosphere and 4 - Generation user interface design language. It is present for MathWorks. It includes matrix operation, secrecy of data, execution of algorithms and creating user interface. The major drawback of these tools is that they require a lot of complex programming and need some expertise users to build and use them according to the users need. Sometimes database take little bit time for executing the output.
MATLAB has two major drawbacks:
MATLAB is an interpreted language that means it performs gently rather than compiled languages.
And last but not the least it's Cost: a complete version of MATLAB is pretty much higher as compared to conventional C and FORTRAN compiler.
2.3. Proposed system:
Here a non-commercial Java based software tool named DSTA (Data mining software tool using genetic algorithms) is presented. This empowers the user to assess the behavior of Evolutionary algorithms for different kinds of Data Mining (DM) problems like regression, classification, clustering, pattern mining etc.
DSTA is a software tool developed to build and use different Data Mining models. This software tool is a type of java tool containing a free code Java library of Evolutionary Learning Algorithms.
Advantages: the DSTA can deal with these benefits:-
The first is less programming effort. DSTA has large collection of Genetic Fuzzy System algorithms based on separate paradigms and join together with distinct pre-processing methods.
The researchers with very few knowledge would operate these algorithms to the problems effectively.
The software tool can run on any computer with Java. So, it is platform independent.
2.3.1. Selected Software:
JAVA (JDK 1.6):
This is one of the most popular languages used these days. One of the main features includes a platform independent language.
Some of the other features of java include:
Java is a programmer's language.
Java language is cohesive and consistent.
Java gives the programmer, full control and provides better security features when compared to other programming languages.
Java is an efficient Internet programming language.
Sebastian. Ventura et al. (2007, pp. 381-392) JCLEC: a Java framework for Evolutionary Computation
This paper discusses Java class library for evolutionary computation (JCLEC) is software for research in evolutionary computation research, providing high level computational support for any kind of evolutionary algorithm, genetic algorithm, genetic programming, evolutionary programming research and development etc.
JCLEC control some of the tough principles of object oriented programming, where objects are loosely coupled with a common and easy to reuse code.
It provides an efficient, generic, robust environment for working with different genetic algorithms.
Generic: In JCLEC the users can perform almost any kind of evolutionary computation subject to conditions like it accomplish certain basic demands and limitations. It supports a large number of evolutionary flavors like genetic programming, bit stream vector, actual value vector genetic algorithms etc. One of the other striking features of it includes the support for advanced evolutionary computation techniques like the multi-objective improvement etc.
User Friendly One of the important quality of JCLEC include it is very easy to use and user friendly. It handling a user friendly interface with high level programming experience.
Portable It is extremely manageable and can be used on all platforms or counters which support java.
Efficient It possesses a critical code section providing an efficient execution platform.
Robust: It has got the Verification and validation statements which are embedded into the code to ensure correct operation and to inform the user when there is a problem.
Free Source The basic program of JCLEC is open, which is exciting below the General Public License (GPL). Therefore, it can be supplied and altered without any costs.
Fig 1. Three layers comprise the JCLEC architecture
Source: Sebastian. Ventura et al. (2007, pp. 384) JCLEC: a Java framework for Evolutionary Computation
In the lower layer it is the system core. It has classification of the conceptual form, its base implementations information about some software selection that presents all the required efficiently to the system. Completed the core layer there is experiments runner system which consists of a sequence of evolutionary algorithms implementation distinct by process of a structured file. It receives as input this file and it returns as result one or several reports about the algorithms executions. In the upper layer there is a Graphical User Interface (GUI) for Evolutionary computation called GenLab. It helps in solving difficulties more easily using the available Evolutionary algorithms from a particular method. It arranges the algorithm, and then performs them in a shared method there by generating on-line information about the evolutionary process. The user can contain their own program subject to condition that the advanced program achieves the hierarchy defined in the system core.
3. Introducing (DSTA)
Data Mining Software Tools Integrating Genetic Algorithms.
J. Alcala Fdez et al. (2008, pp.1-12)) the research shows some data mining software tools to explain the advantages of DSTA. So, begin with Data Mining Software Tools. There is huge range of collection for D.M software tools, first sort out by its licence type commercial (SPSS Clementine, Oracle D.M, Knowledge STUDIO) and non-commercial data mining software tools. The research go further by discussing open source tools that show major task to grow latest evolutionary algorithms for particular use and group of data mining that combine with learning algorithms. The researcher shows their interest on data mining tools to solve their problems, the most popular data mining platform in open source system is "Weke".
There is a list of non-commercial Data Mining software Tools are: -
ADAM: this platform is a group of free module planed to finish in grid and cluster atmosphere. This toolkit comes with some benefits such as skills, image processing and data cleaning.
DSK: is also known as Data to Knowledge toolkit, it can be access through java programming atmosphere. This toolkit combines with external platform to run image and text mining. Data to Knowledge also proposed peripheral set of evolutionary mechanisms planed for evolving genetic algorithms.
Weka: is one of the best open sources for machine and data mining atmosphere. It can be access through java programming or through command line interface, it is also GUI. The tools working on data pre-processing, classification, regression, clustering and visualization. It also known as Waikato Environment for Knowledge Analysis.
Tanagra: aim to designed Tanagra data mining software tool is for research and education. It includes lots of machine learning structure, data research and experimental study.
There is lots of software tools that is not mentioned above those have their own features and proof them in distinct methods.
Features of data mining software tools: the research study about the different characteristics of data mining software tools.
Languages: Open Source data mining software tool used programming languages like Java and C++ but Java language easy to handle instead of C++.
GUI: Graphical User Interface provide user friendly environment. It includes following characteristic:
Data Imagining: It consists of data set by mean of charts, tables and so on.
Data Management: It includes major task such as deleting, altering data.
Graph Representation: It show the flow of data or information in a tree structure and also say that it shows parent and child connection.
Input/ Output: This feature stand for distinct data formats.
Pre-processing: Pyle(1999) defines data processing as one of the important steps of data mining software tools and focuses on some of the important data mining processes as:
Data cleaning: This is one of the major problems in data mining as the data we want to mine is full of unexpected and useless values which would be of no interest to us. This step involves fill in the missing values, correcting some sort of data which is inconsistent, smoothing out noise data etc.Â
Data integration: This step involves combining the data from various sources, identifying real word entities from multiple data sources etc. It involves removing the data which is duplicate and redundant etc.
Data transformation: It involves removing the noise from data; data is scaled to fall within a small specified range etc. It helps in summarization and generalization of data.
Data reduction: It is a process in which large sets of data which is quite hard to handle is broken into smaller subset which would still produce the same results .It is basically done by Dimensionality, reduction, aggregation and clustering mechanisms, even sampling is used some times etc.
Â Data Discretization: In this process a range of continuous attributes are divided into intervals some of the major techniques for doing this include binning methods, entropy based methods etc.
Learning Category: Is a foundation that supports the central field of data mining like projecting job (class, regression) and graphical job (clustering).
Off / Online: Is a path of research. Online research run moreover based on software tool but Offline run free for any other machine, it doesn't need any software requirement.
Advanced Features: is consist are as follow:
Post-Processing: commonly using for the educational model with algorithms.
Meta-Learning: it consists of new development education program like bagging and skill.
Evolutionary Algorithms: this feature indicating the function of genetic algorithms in new procedure.
The diagram shows the feature of software tools. There are some basic software tools that have none and basic assistance for pre-processing and statistical test.
ARFF data format
Other data formats
Data Base connection
Missing values imputation
N: None, Y: Yes support
B: Basic support,
A: Advanced support,
I: Intermediate support.
Table 2. Shows the features of D.M software tools.
Source: J. Alcala Fdez et al. (2008, pp.4)
By studying the above software tools the research consider the user requirement, for what purpose they can study the performance of evolutionary or non- evolutionary algorithms for unique style of learning and pre-processing job along with experiment (offline and online). According to user needs the research introduce DSTA (Data Mining Software Tool using Genetic Algorithms).
3.2 Introduction of DSTA
The research introduces a non-commercial Java software tool named DSTA (Data Mining Software Tool sing Genetic Algorithms). DSTA is a generous Java software tool to approach evolutionary algorithms for data mining problems like classification, clustering, regression and pattern mining. The current model agrees to finish and through study of any learning model in contrast to existing one, as well as a statistical test model. It includes the features suitable for both research and educational goal.
DSTA as Research Tool: The best use of DSTA for researcher to resolve the computerization experiment, as well as measuring the results on a large scale.
DSTA as Educational Tool: The student requirement is completely diverse as compared to researcher. Educational Tool doesn't require doing the same experiment lots of time. If this tool runs in class, the implementation time need to be quick and also support the real time view for the development of the algorithms that is required by the student. So, they also receive knowledge on how to handle the limitation of the algorithms.
DSTA can approach have numerous benefits:
It consist a large library with Evolutionary algorithms based on different paradigms like Pittsburgh, Michigan and so on. The integration with distinct pre-processing method also makes it easier.
It spread the variety of possible users to operate Evolutionary algorithms.
This software can be use on any system with Java.
Before discussing the data mining software tool using genetic algorithms, have to know about genetic algorithms working in DSTA.
3.2.1. Genetic Algorithms in DSTA:
Genetic fuzzy systems are one of the most common structures now days. Genetic algorithms offer a great mechanism to translate and progress instruction originator aggregation operators, different rule semantics and an effective source of providing a d-fuzzification method. Genetic algorithms in these days are some of the powerful knowledge gain schemes capable of designing and in some sense optimizing FRBS as per the design decisions.
The research is using Genetic fuzzy system methodology in two procedures one is Tuning and the other is Learning. They both work as follow:
Genetic tuning of scaling function: In this the scaling role is useful for input and output changes of an FRBS and normalizes is the creation of disclosure in which the fuzzy relationship roles are distinct from the understanding of engineering approach so, they can collect data relating to the environment to explain relative semantics into absolute ones.
It presents an information then effort to operate a genetic tuning process for developing and then at last refining the fuzzy rule base systems execution.
The genetic learning of the instructions only spread on to expressive FRBS as in the rough methodology adapting rules to alter the membership role.
Example of GA's for one of the given method is:
Genetic tuning of Knowledge base parameters: A tuning function is for locating highly-execution fuzzy control rules to process of special Genetic algorithms. It deals with the unrelated search space. It includes the genetic illustration such as the multi-chromosome and genomes. A Genetic FRBS system that converts single instruction rather than whole KB's is an important function of finding flexible, difficult instruction in which the explanation remains pretty cost effective and flexible.
3.2.2. DSTA integrated with three main blocks:
Data Management Module: This part of module invented with normal tools that can be working are as follow:
To form fresh data.
To distributed and import data in different formats according to the condition.
It is responsible for data visualization and deletion.
To use for alterations and division of data.
Most of the time datasets in .dat format unable to run in experiments and show some error. To remove this problem user can change old file into new one with add some tools.
Last but not the least is partition. This is use to divide the complete file in couples of training and test files are known as Complete Datasets.
Design of Experiments Module: It is a Graphical User Interface that allows the design of experiments for solving different machine learning problems. Once the experiment is designed, it generates the directory structure and files required for running them in any local machine with Java. It is also known as off-line module. The very first step in experimental module is to pick one of the type of partitions option are show as follows:
K-fold cross validation
5*2 cross validation
After that select one of the type of experiment that is Classification, Regression and Unsupervised learning
List of Algorithms used in Experiments Module: Below mentioned are some of the algorithms used in our data mining tool.
This section includes: discretizer, feature selection, instance selection, transformation and missing values. In short what we do here is that we just try to filter out the useless values from the raw data and make it useful and easy to manage and mine.
Fig 2. Preprocessing Algorithms
There are lots of subpart of these algorithms that is very hard to explain all of them so the research explain only one of them to understand how those algorithms works.
a). Discretizes :
Table 3. Disc-UniformWidth: it is used as the following:
Access Uniform Width Discrtizer algorithms for changing a position of numerical variables into typical variables.
Request of a data Discretization pre-processing work.
Only for categorized information pre-processing.
Case requirement have at least one nominal output.
Set of Discretized case.
Record point the verity of cutting element exploited in the discretization.
The steps of data tool are as follow:
Open DSTA application.
Click on Data Management in the window.
Click the preparation button.
Select the original Data set file to modify.
Select the Data sets Directory to Save the Solution.
Select Discretization as modify to apply.
Then click on pre-processing and select Dic-uniform Width.
Click on parameters to alter parameters of the algorithms.
At the end, click on Transform to run the algorithm.
The steps that the users have to follow by using the experimental tool.
Open the DSTA application.
Click Experiments in the framework.
Select the type of partition and then select the classification button.
Select the data set to use the algorithms and click in the experimental desk.
Click on pre-process algorithm button.
Select Disc-Uniform width, placed in the algorithms framework: Algorithms > Discretizers >Dicuniform Width. Then select experimental desk.
Select the right pointer in the tool panel, then seventh button in a vertical panel and attached the dataset button with discretizer button in the experimental desk.
Click on blue triangle button in the toolbar and save the experiment in a zipped file.
Access the experiment by unzipping the record and running the command in the scripts covering.
Java- jar RunKeel.jar
Frequency of use
Depend on the users
Notes and Issues
Classification Algorithms:- It includes these algorithms methods Statistical Classifiers, Decision Trees, Rule Learning, Fuzzy Rule Learning, Neural Networks.
Regression Algorithms:- This includes Statistical Regression, Fuzzy Rule Learning, Symbolic Regression and Neural Networks.
Non-Supervised Learning:- This includes Clustering Algorithms, Subgroup Discovery and Association Rules.
Statistical Tests: This section includes Test Analysis for classification and Test Analysis for regression.
Visualize Results:- It shows single results or multiple results for regression and classification.
Educational Experiments Module: This module allows for the design of experiments that can be run step-by-step in demand to show the learning procedure of a specific model by using the software tool for educational purposes. Results and analysis are shown in on-line mode.
Fig 3. Design Educational Experiments.
3.2.3. The main features of DSTA tool are:
It contains pre-processing algorithms helps in performing transformation, discretization and feature selections.
It also contains full-fledged Knowledge Extraction Algorithms Library, supervised and unsupervised, remarking the incorporation of multiple evolutionary learning algorithms.
It has a statistical analysis library to analyze different algorithms.
It contains a user-friendly graphical interface, oriented to the analysis of different algorithms as per need.
It has got an environment which can be connected to Internet to download new data files for using them in future analysis.
It has got a feature to upload the databases in to the tool r the databases can also be accessed through the web.
3.2.4. DSTA V/S Weka D.M Software Tools:
Input/ Output Variety
Off-line Run Type
Table 4. Comparison between two software's
Through table 3 it is clear that DSTA software tool much better than other software tools.
The design of experiment part has the feature of designing the desired experiment using the graphical interface and after designing of the experiment a zip file is generated with the required directory structure to run those experiments on local computer. This interface also allows the users to add their own algorithms for the designed experiments.
The tool generated the evolutionary algorithms with the help of JCLEC library. This allows the users to create their own evolutionary algorithm using the available Graphical Interface.
Let's consider how to implement DSTA:
Datasets can be switched from several formats to DSTA format.
This tool allows the user to evaluate the performance of evolutionary algorithms for distinct data mining problems.
Data mining problems are classification, regression and so on can be explained.
Performing research on various datasets is very simple.
Research can be understands by gradually in educational module.
DSTA provide benefits to users:
User can work on several formats of data.
DSTA provide several services to user to select the experiment.
User can select different sort of algorithms which suits for their data.
User can view gradually the development of their experiment.
User has advantages to save the output in the required directory.
User can clearly identify the working of DSTA.
To implement DSTA the report considering two examples:
Off-line Experiment and
DSTA support three module Data Management, outline of Experiment and Education module. Data mining software tool through evolutionary algorithms get rid of data mining problems including classification, regression and unsupervised learning. Let's consider the first example
3.3.1. Off-line Experiment:
The report is studying on the development of relationship of two Fuzzy Rules methods of algorithms that is Class-Fuzzy-Slave and Class-Fuzzy-Chi-RW. DSTA has pre-defined datasets or user can create their own datasets as well according to the requirements. The following are 12 pre-defined datasets problems for classification: -
The research select one of the pre-defined datasets before select the datasets user must select the type of partition and type of experiment. The example selects K-fold partition and classification problem. The experiment run with 10 - fold cross validation that means data divided into ten training and test files.
Example: Class-fuzzy-Slave algorithm has run instance value is five, so that's mean its complete number of runs is 5 X 10 = 50.
The experiment chooses Wine datasets with 10-fold cross validation, also user can pick out other datasets at the constant time. Once the data is divided, the progress contains set of training and test datasets. The experiment includes:
Dataset Name: Wine.
Method 1 algorithm: Class-Fuzzy-Chi-SLAVE
Method 2 Algorithm: Class-Fuzzy-Chi-RW
fig 4. Graph Represent Progress of Data.
First step is to selecting of datasets then select the method of algorithms, test analysis and visualization of the outcome. The node can be easily identified with colour contrast. In fig 4: -- data connected with two learning method one is Class-Fuzzy-Chi-SLAVE and another is Class-Fuzzy-Chi-RW. The both methods connected to visualization class tabular and test analysis algorithm that is Stat-Class-Wilcoxon. When the graph successfully connected with nodes and arrows (used to control the connection between nodes) the last step is to save the experiment. The experiment saves into ZIP file or XML file for off-line run.
After the experiment is complete, the illustrations of the dataset are recorded according to the training and test files. These solutions are the response for the visualization and test analysis. The visualization algorithm that is Vis-Class-Tabular have these solutions are as feedback and create output record through numerous implementation metrics computed from them. They are as follow:
Confusion Matrices for every single method.
And Finishing brief of solutions.
Fig 5. Experiment Created Successfully.
Additional category of solution is Stat-Class-Wilcoxon by mean of statistical comparison of two methods begin through experiment frame as XML text and Jar package. The experiments are graphically modelled. They represent a various link among data, algorithms and testing/visualization modules with some sort of qualities such as type of leaning, validation, number of runs and algorithms parameters can be easily configured. Once the experiment is created, DSTA produce a scripts based software which can be run in several system with Java Virtual Machine, run with this command java -jar RunKeel.jar.
3.3.2. On-line Experiment:
On-line Experiment runs through the educational block. This section continues the same step that already discuses in the off-line experiment. The run of the experiment proceeds on different window that show in fig:6 . The user can start, stop and pause the experiment at several time in demand to see the implementation.
Fig 6. On-line Experiment.
So, the experiment studies the solutions. The solution display that Fuzzy-Chi-SLAVE algorithm does not find the finest training and test accuracy in whole dataset.
Table 5. Show the percentage of success in partition of methods.
Through table it is clear that Class-Fuzzy-Chi-RW is the best method to solve the classification problems as compared to Class-Fuzzy-Slave.
TESTING PHASE IN DSTA
The research has successfully introduced a tool for solving data mining problems like classification, regression, pattern mining, clustering and so on. The Datasets have imported and performed operations have shown the difference before and after applying algorithms, but there are still some restrictions that have found in the Testing phase of DSTA.
Restrictions: There are some restrictions that must be considered when making connections between the different methods and datasets, test analysis and visualization. Those are as follows: -
â€¢ A dataset cannot receive inputs.
â€¢ The pre-processing algorithms can only receive inputs from a Datasets or another pre-process method.
â€¢ DSTA can receive data from a Dataset, from pre-processing algorithm or from a previous method.
â€¢ The test algorithms must receive input data from a method or from a post-processing algorithm.
â€¢ Test algorithms cannot receive outputs.
Fig 7. Testing Results
3.4.1. Sample Test Case for System:
Test case ID -DSTA-tc-01
Input -Relation name blank and attribute blank
Description -Relation name and attribute fields are mandatory
Test case ID -DSTA-tc-02
Input -Relation name given and attribute blank
Description -attribute field is mandatory
Test case ID -DSTA-tc-03
Input -Relation name blank and attribute given
Description -Relation name field is mandatory
Test case ID -DSTA-tc-04
Input -Relation name given and attribute given
Description -Dataset created successfully
In this work, the research described non - commercial Java software tool know as DSTA (Data Mining Software Tool using Genetic Algorithms), a software tool to assess Evolutionary algorithms for Data mining problems, paying special attention to the Genetic fuzzy system algorithms integrated in the tool.
The research for my project is based on the verity of study of different issues from available literatures or previous works of the researcher. The research shows some D.M software tools and also shows comparison between them to understand more clearly about DSTA. It also discusses the step by step implementation methods and also describes how user can take benefit by using this tool.
It provides the researchers with allows them to focus on the analysis of their new Genetic fuzzy system algorithms and relives them from heavy programming stuff. Moreover, the designed tool can be used by anyone with limited knowledge about the genetic algorithms and they can use it to build their own systems.
This software tool is being continuously updated and improved. The research is introducing a new set of test tools.
5. Critical Evaluation.
5. Critical Evaluation
Data mining is one of the advance developing fields in the area of computer sciences, so a development of data mining tool is a type of advantage. There are to many data mining tools in the market but if user can introduce an open source tool it can be used by commercial software tools. The literature discussed so far gives a proper understanding as all what different data mining tools are available, their advantages and disadvantages.
A data mining tool using different genetic algorithms has been successfully introduced. The tool consists of different genetic algorithms integrated in it and performs all the data mining operations with maximum accuracy and efficiency.
The tool possess some of the major features like faster execution to complex data mining queries when compared to use of normal SQL to get the output for these queries.
During the analysis report have been successfully able to insert, modify, alter data and data sets and were also successfully to mine these data effectively and get satisfactory results screen shots are the evident of these analysis. The report also shows the comparison of two methods of Fuzzy Rules and receives the best method by comparing them. The best thing in this tool is user can be use any type of file formats
Example: Under some complex querying, the tool was able to get answers almost 8 times faster than a normal SQL query would to like for Wine database querying describing about the percentage of success in each partition were able to get the query almost 5-6 times faster than a normal SQL statement.
Also absorbed the results that ob tain were almost constant and steady as well. The another important feature of tool is that user can also integrate other data mining algorithms according to the need to make the way they want it to be. There the tool successfully implements and processes complex data mining queries and is worth for what it was supposed to be used.