Multi Relational Data Pre Processing Techniques Computer Science Essay

Published:

This project aims to develop a prototype application which pre-process the multi-relations data and build a data model for data mining. It analyses the past approaches for multi-relational data mining approaches and based on these approaches design a prototype application which can cope the real world's multi-relational databases. It follows spiral development model that is developed in Microsoft Visual Studio 2008 with C# and SQL Server 2008. Currently it using any type of SQL Server database and build a data model. This is suitable for any kind of data mining tool.

CHAPTER 1: INTRODUCTION

1.1 Summary

This chapter sets the framework for the entire project. Firstly a brief introduction of the aim of the dissertation is given. The aim of this dissertation will be introduced briefly at first. Then the methodology used for the development of this project will be described. Finally a brief overview of the dissertation will be given.

1.2 Introduction

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

In recent advances in high throughout data acquisition, digital storage, and communication technologies have made it possible to gather very large amount of data in many scientific and commercial domains which resides in relational database. But many of the data mining techniques were not for relational databases or if any developed was for Prolog data by using ILP different approaches. This project in concerned about multi-relational data mining pre-processing techniques and build a data model from multi-relational database for data mining process.

1.3 Aims and Objectives

This prototype system design to build a data model as per users requirement, on which user can apply different data mining approaches to analyse.

Allow user to select database type.

Allow user to select database.

Allow user to select a table to build data model.

After selection of target table, system will find its all relational tables by target table's primary and foreign keys.

Analyse all tables and there columns.

Build a model and give ability to user to remove un-necessary values and export it to required format.

1.4 The Structure of the Dissertation

The structure of the dissertation reflects the basic steps of the methods selected to produce this prototype application. At the beginning of the each chapter there is a brief summary outlining structure of the chapter as well as detailed work in each section. More specifically, the whole dissertation is divided into six chapters.

The outline of each chapter is described below:

The second chapter will focus on background and literature survey. General concepts in data mining and different approaches are described in this chapter.

The third chapter will give a detailed review about the methodologies and technologies which used in this project such as Microsoft Visual Studio 2008, SQL Server 2008 and C#.

The fourth chapter is concerned about the architecture with different layers and physical implementation of the application.

The fifth chapter illustrates some tests that are undertaken in order to ensure that the proposed application has been built correctly and meets its requirements.

The final sixth chapter summarize the whole project's objectives and suggests the possible future work.

CHAPTER 2: BACKGROUND & LITERATURE REVIEW

___________________________________________________________________

2.1 Summary

The chapter discusses the general concepts pre-processing techniques for multi relational data mining, including data mining and research about multi relational data mining and different kinds of techniques, are discussed in this chapter.

2.2 Overview of Data Mining

Data mining is not only collecting and managing data. Data mining is involved to extracting the hidden predictive information/knowledge from large databases [1]. Data mining can be performed on different kinds of data such as quantitative, textual, multimedia forms and etc. This powerful technology can help companies to focus on the most important information in their data warehouses which can help them to predict future trends and behaviours of businesses.

The objective of data mining research is to develop more efficient solutions for to discover an interesting knowledge and previously unknown knowledge from larger databases [2; 3].

As [1; 4] mentioned is the central and essential process in Knowledge Discovery in Databases (KDD) and [5] mentioned KDD process consists 3 subtasks. In the first task fit the input format of data mining algorithm of original data format (pre-processing the data). After data formation, for the extraction of interesting patterns, regularities or general laws from data, one or more algorithms must be applied and this phase called proper data mining. The last phase called post-processing of results in which may be obtained results by data mining process need to translate in to more intelligent format.

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

There are two broad categories in which we can classify the data mining algorithms predictive or descriptive. Predictive data mining is the process of automatically creating a classification model from given examples, the most popular example of predictive techniques is decision tree induction (e.g., Quinlan's C5) [5]. On the other hand, descriptive data mining technique is to characterize data by the help of finding patterns and regularities in the given examples. Descriptive data mining could be classified into clustering, association and sequential analysis [7]. Discovering of association rules is its example mentioned in [8].

2.3 Multi Relational Data Mining Approaches

In this section we will discuss some approaches which have been purposed in the past.

2.3.1 Inductive Logic Programming

ILP is situated at the intersection of two most important areas of Computer Science. One is Induction, which is the main technique, used in several Machine Learning algorithms to produce models that generalize beyond specific instances and other one is Logic Programming which is the programming paradigm that uses first order logic to represent relations between objects and implementing deduction reasoning. The main representative of the paradigm is Prolog.

Initially ILP focused on to developing algorithms for the synthesis of logic programs from examples and background knowledge (i.e. knowledge acquisition for some domain). Recent ILP developments have considered classification, regression, clustering, and association analysis [4]. Its flexible and expressive way of representing of background knowledge and examples, the field has been expanded from single table case to multiple table representation.

Here is small ILP example taken from [9].

Let E+ = {daughter (Mary, Ann), daughter (Eve, Tom)} be the daughter relation's positive examples,

E- = {daughter (Tom, Ann), daughter (Eve, Ann)} be the same relation's negative examples; and

B = {mother(Ann, Mary), mother(Ann, Tom), father(Tom, Eve), father(Tom, Ian), female(Ann), female(Mary), female(Eve), male(Pat), male(Tom), parent(X,Y)

← mother(X, Y), parent(X, Y) ← father(X, Y)} be the background knowledge, where the relations daughter, mother, father, female, male, and parent have the common meaning. Then the goal of this ILP example is to learn daughter concept. For the predictive ILP solution could the following synthesized clause:

daughter(X, Y) ← female(X), parent(Y, X).

or a set of definite clause:

daughter(X, Y) ← female(X), mother(Y, X).

daughter(X, Y) ← female(X), father(Y, X).

The induction operation is harder than deduction and inverse operation of deduction. Deduction process is based on the use of sound rules of inference but inductive inference involves unsound conjunction based on statistical support from data [10], which is makes harder to produce well established ILP engines similar to those existing in inductive logic programming with Prolog.

ILP has been first and most expanded approach among the other learning approaches. However its use in relational data mining approaches has been limited due to differences in input specification and language bias specifications of different ILP engines [11]. In order to deal with formalism and to unify the different input specifications of different ILP engines and for the large ILP engines, in [11] proposed Unified Modelling Language (UML) as the common specification language. The idea was to replace the logic-based formalism to specify the language bias with a language that is easy to use, can be used by a range of ILP systems, and can be represent the complexity of the problem in clear and simple way. By the help of this way, even the non-expert users will be able to model their problems and use a wide range of ILP engines.

Three ways of connection between ILP and relational database are presented in [13] and briefly described below:

The simplest approach is to pre-process the relational data into Prolog syntax.

The second approach is to build link between a Prolog Systems and relational databases. A relational database is given with the Prolog knowledge base. Each time a literal p (a, b) has to be evaluated, where the predicate p corresponds to a relation P in the relational database, Prolog has to open a connection with the database and make corresponding query to determine whether the tuple (a, b) is in P.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

The third way exploits the close relationships between first order logic and relational databases (a predicate is relation between its arguments which are, in turn, attributes in relational databases) and between logical clauses and relational database queries. Therefore, entire logical clauses (not only a literal at a time) can be translated to SQL statements and submitted to the database to get the statistics. These systems implemented solutions already exist. One of the systems is RDT/DB [13; 12], which couples the ILP system RDT [14] with the Oracle database system.

There are several different ILP engines existing and most of them can be placed within one of two main logical approaches: learning from entailment and learning from interpretations. Explanatory ILP which can get from entailment approach [5] and most of ILP systems are learning from entailment such as RDT [14], Prolog [10], FOIL [15], SRT [16] and FORS [17]. Learning from interpretations is called descriptive ILP. Examples of the ILP engines which are based on this approach are Claudien, ICL and Tilde [8]. In [5] was showing how learning from interpretations evolved descriptive techniques to predictive one. There more approaches to ILP can be found in [18].

The two main ILP approaches differ from each other in the way of representation of examples, background knowledge and the way final hypothesis induced. In learning from interpretations each example e is represented by a separate Prolog program encoding its specific properties as sets of interpretations and the B is represented background knowledge in the form of another Prolog program. Entailment paradigm's learning represents all data examples together with background knowledge as a simple Prolog program and there are not separations between examples. In learning from interpretations setting, a hypothesis is a set of clauses such that only each positive example with the help of background knowledge makes each clause in the set true. In notation, let E+ and E- be the sets of all positive and negative examples respectively and H hypostasis has to found such that:

e E+ : H is true in M(e B)

e E- : H is true in M(e B)

Where M (e B) is minimal Herband model of e B.

Learning from entailment framework the induction problem is to find H such that the following constraint holds

e E+ : H H B entails e

e E : H H B does not entails e

There are two main approaches to induction process in the learning from entailment. One is invers deduction and other one is a generalization of top-down induction techniques to the first order case. The example of the system which is using inverse deduction technique is Prolog [10]. This example uses inverse entailment technique which is a generalization and enhancement of the inverse deduction. Example to second approach is FOIL system [15] which is the first ILP systems using top-down induction. Most of interpretation learner ILP systems uses top-down induction hypothesis. Although it has been shown learning form interpretations reduces learning from entailment [18], which also reduces to learning from satisfiability (learning from partial interpretations, where missing values are considered), an increased interest has been focused on learning from interpretations in several recent ILP systems.[5, 8]. The attribute value representations are the important reason for this special case setting. But many attribute value techniques exploit the independence of examples. The learning from interpretations setting also assumes this independence through the separation of information between examples. Because of that attribute value learning algorithms can be upgraded in a trivial way to the learning from interpretations setting than to the learning from entailment and this has been shown by the systems Claudien, ICL and Tilde [8]. The learning from interpretations setting has also inspired the shifting from pure logical search space to one consisting exclusively of database queries in relational algebra or SQL like language in [12; 19; 20].

2.3.2 First Order Extension of Bayesian Networks

Probabilistic Models specifically Bayesian Networks (BNs) [21] differ from the Inductive Logic Programming approaches by specifying a probability distribution over a fixed set of random variables that could be the attributes in an attribute-value setting. Thus, Bayesian Networks are a probabilistic extension of propositional logic [22]. The most important features of these models are their capability of reasoning under uncertainty. But these models carry same limitations as propositional logic when they applied to relational data.

In order to handle these limitations, several approaches have been proposed to combine first order logic and Bayesian Networks in the literature. Different techniques for this problem have been appeared in different fields. The most prominent ones are Probabilistic Logic Program (PLPs) [23] from logic programming; Relational Bayesian Network (RBNs) [24] from model theory, and Probabilistic Relational Models (PRMs) [25]. In spite of their different backgrounds, they all seem to share most essential information represented in Bayesian Logic Programs (BLPs) as shown in [22]. The last approach is simplification and reformulation of PLPs.

Qualitative and quantitative are the two important components of Bayesian networks. The qualitative part is represented by a directed acyclic graph where each node represents a domain variable and each arc represents probabilistic dependency between two variables. The probabilistic dependency is the quantitative part which represented through a conditional probability distribution for each node.

Probability distribution via conditional independence can be represented compactly in BN. A BN can compute the conditional probability for each value's node given that values have been assigned to the other nodes. Similarly BN classifier can compute conditional probability of the class value given the values of the other attributes.

There are several motivations for the use of BNs for relational data mining. The BNs main advantage is that they can represent relationships naturally among attributes and this is clearly needed in relational domains. Also BNs structural representation technique make easy to understand between attributes and domain experts could modify easily to obtain better predictive models.

There is one drawback of BNs; generally the domain of each random variable has to be finite. Thus, before learning the network and its parameters must be used some of discretisation algorithms, because of then sometime results lost important information.

2.3.3 Multi-Relational Data Mining

In [12] initially proposed an idea of shifting from a pure logical search space to a search space based on database queries. In [26] was shown implementation of this algorithm and that algorithm is for relational algebra, although database query language such as SQL is straightforward.

That work was motivated because most of the ILP systems were with a deductive database such as Prolog database, rather than relational database. Therefore, systems that exploit relational database capabilities directly were needed.

A multi-relational data mining (MRDM) framework (19) was proposed later by same group. This system was based on ILP techniques and search space consists of SQL queries.

2.3.4 Other Recent Approaches and Extensions

The way of transforming a multiple relations database into a single relation database with help of propositional learning algorithms and also it create new attributes in a central relation that summarizes or aggregate information from other tables in the relation database. Variants of this method have been used in systems such as LINUS [27] and DINUS [28]. These systems are examples of ILP methods which transform restricted ILP problems into attribute-value form by propositionalization technique and solve the transformation problem with help of propositional learning algorithm. Because of some limitations these two systems cannot tackle relational database problems such as non-determinate relationships (a record in one table can have more than one corresponding records in another table).

In [29], introduced a new transformation-based ILP learner which can deal with non-determinate relationships, non-constrained literals and it is business domain oriented because relational database are often simply structured but large in size. This approach is called Rely on Aggregation and achieved significant results from business domain [29] and biology domain [30]. This feature construction method shows that if good attributes are constructed by the propositionalization algorithm, a feature-based learner can outperform relational learning technique.

Continuing with same trend, [31] proposed the use of a simple propositional Bayesian classifier in an iterative way for relational domains. This approach keeps the relational structure and the objects are "flattened" when is required by the algorithm. Inferences made about an object can assist inferences about related objects; therefore, inferences made with high confidence about some objects are fed back into the database and used for posterior inferences about possible related objects. Although simple Bayesian classifier is an attribute-value leaner, keeping the relational structure of the data helps to perform the feedback procedure into related objects.

Structure data mining approaches have been proposed as well in the form of graph in [32; 33]. In this work there is corresponding between objects in the data and vertices in the graph, also relationships between objects correspond to directed or undirected edges in the graph. The Subdue looks patterns embedded in graphs. Once a pattern (substructure) found, it is add to the graph by replacing instances of the substructure with the substructure itself.

2.4 Conclusion

In this chapter we explore most of the data mining techniques especially multi-relation data mining and their all techniques in details. Most of approaches are by ILP with Prolog database and this section discussed some other approaches as well all. This chapter also discussed all approaches advantages and disadvantages.

CHAPTER 3: METHODOLOGIES

___________________________________________________________________

3.1 Summary

This chapter discusses methodologies with all some models and it will explain about technology with advantages and disadvantages.

3.2 Methodologies

Software can be decomposed in to small number of activities [38, page 25-37]. There are number of methods which can help to complete a project, such as few basic steps analysing the requirement, designing the system, implantation and finally testing [39]. These models help to organize project efficiently. Each model has different pros and cons and normally model selected by user's requirements. Some models are explained as under:

3.2.1 Waterfall Model

This model sometimes referred as Linear-sequential Life Cycle Model and is most common classic model in all other life cycle models. It's very simple to design, manage and use. Each phase goes in sequence one after another and basic phases are Analysis, Design, Code and Testing. The drawback of this model is that there is no looking back [38, page 25-37]. It cannot accommodate any change once if a step has been completed. This model is suitable for small projects because in bigger projects, developer need to go back for several time to achieve user's requirement. This model is fine for small project but it's not able to deal with unseen problems arising during project because of no risk management.

3.2.2 V Shaped Model

"Just like the waterfall model, the V-Shaped life cycle is a sequential path of execution of processes. Each phase must be completed before the next phase begins. Testing is emphasized in this model more so than the waterfall model though" [39]. In V-Shaped model each phase has have to delivered test results at each step. The results which make the development inflexible for any unseen problem or intentional problem in the project should identified at early stage. The difference between this model and waterfall model is early test plan but still without risk management [40].

3.2.3 Incremental Model

"Incremental model is an evolution of waterfall model. The product is designed, implemented, integrated and tested as series of incremental builds" [41]. This Model combines the elements of waterfall model with iterative philosophy of prototyping i.e. a general objective definition for the project [38, page 25-37]. This model can be term as "multi-waterfall cycle" if multiple development cycles take place in this model [39]. This model suggests for all kind of change at each stage. After any stage completed then backing up to the previous stage is not possible unless change or tested can occur during stage. This model does not offer fully flexibility but it offers some extent. After completion the project no feedback required which may subject the project to real world data without any testing and end up with project failure.

3.2.4 Spiral Model

Systematic aspects of the waterfall model and iterative nature of prototyping with the controlled are couples with Spiral Model. A project is developed in a series of incremental release [38, page 25-37]. "The spiral development model is a risk-driven process model generator that is used to guide multi-stakeholder concurrent engineering of software-intensive systems" [42]. A Spiral model is with more emphases placed on risk analysis and similar to the incremental model. Planning, Risk Analysis, Engineering and Evaluation are the four phases of this model. The project before reaching to completion, it goes through server iterations on these basic phases [39]. This model's each phase is repeatedly corrected in different iterations, so it is suitable to big projects. As shown in the figure 1.1 a project starts with requirements phase followed by the risk assessment and so on. With the help of Risk Management techniques it analyse effectively all unseen problems and changes.

Figure 3.1 - Spiral Model [42]

Spiral Model revolves around three or six task regions. Task regions are small framework activities that are repeated in the iterations [38, page 25-37] and explained as under:

Customer Communication

Customer interaction involves at each phase in this activity and it builds a relation between customer and client. This activity is helpful in minimizing the errors in an application and hence provides more room for fault tolerance [38, page 25-37].

Planning

The planning section help to define all required tasks and assign the resources, timelines, milestones and necessities [38, 25-37].

Risk Analysis

This phase evaluate both managerial and technical risk of the project, also help to help to cover all unseen problems that are not include/visible in the project plan [38, page 25-37].

Engineering

This region includes all tasks that are useful in the project to build one or more representations [38, page 25-37].

Construction and Release

This phase includes tasks that are helpful in implementation such as construct, test, and install and user support [38, page 25-37].

Customer Evaluation

After the completion and release of each version software is tested to obtain customer's feedback and make changes if required [38, 25-37].

3.2.5 Methodology Selection and Implementation

Spiral model is used for this prototype development as it provides high level of accuracy and risk management. Though it requires high level of expertise and costs more than any other model but still the benefits it brings cannot be compromised.

According to the spiral model implementation, as a first step all of the requirements is analysed and system design is built. This software is conducted in three iterations. In first iteration two operations will be performed. In first operations display all databases in the selected data source and in second operation select database and display tables from selected database. In the second iteration select a table to build a prototype data mining model and this iteration will be explained in chapter 4. In last third iteration simplify data model by removing unnecessary attributes and export in required format.

3.3 Technology Review

In this project all technologies been used are Microsoft based. Microsoft gradually increasing day by day and has become the most popular used software company in the world (especially in common users). On other hand majority of developers are using Microsoft's tools because of easy to use and learn. Microsoft has been successful in capturing most of the market even in presence of giants like ORACLE, Adobe & UNIX etc. Recently Microsoft has launched new version of Microsoft SQL Server and Microsoft Visual Studio with its upgraded tools like VB.NET, C#, C++ and ASP.NET etc. These tools are sufficient to fulfil the user's requirements also these all tools are with friendly user interface and with usability features.

3.3.1 Visual Studio 2008

This software uses Microsoft Visual Studio 2008 for development because it provides all tools such as C#, Windows forms, and etc. to complete it. Visual Studio's first version was released in 1997 but it was not capable to cope with any kind of completion. After this version Microsoft released other versions like 2005, 2008 and 2010 which are very effective in the development market. In Visual Studio user can design desktop and web based both applications and it also supports other languages and technologies like Visual C/C++, Visual J#, C#, F#, XML, HTML, JavaScript, CSS, ASP.NET and VB.NET. Visual Studio is an Integrated Development Environment (IDE), it provides tools to software design, development, testing, quality assurance, and debugging [34, page 3].

3.3.2 C#

C# is a combination of C/C++ and JAVA language with the functionalities of Visual Basic. Its development style is like JAVA i.e. strong web integration and automatic memory management [34, page 643]. As compare to C++ or JAVA, C# utilizing .NET Framework Technology in best logical way. C# is power modern language for development and it gives C/C++ programmers a better and easier programming platform [36]. C# is an event-driven, fully object-oriented and visual programming language [37, page 9]. In comparison between Visual Basic, Java and other development tools available in the market, C# chosen because of the key advantages that are required for this software and some advantages are as under:

C# gives powerful forms engine just like Visual Basic [37, page 10].

In C/C++, it's easy to access any chunk of data by defining some elements and start reading/accessing data on memory but C# provide type safety and if someone wants to bypass the type safety it can be done simply type keyword unsafe on the desired block of code to skip through the .NET type security [34, page 691].

C# provides pure Object Oriented approach and its treats even simple data types as an object e.g. simple data type int can be treated and retrieved as string by using ToString method.

C# provides operator overloading facility like C/C++.

C# and VB.NET can carry out same code in their programs to compile and run but [35]'s trials results reveal that C# is bit fast to compile and run than VB.NET.

C# is also released to ECMA (European Computer Manufacturers Association) and has been implemented as published standard.

3.4 Conclusion

The project develops prototype software which facilitates to build a model for data mining from multi-relational data. This project is developed in C# and it's available in Visual Studio 2008 also uses spiral model for development with three iterations in which risks and objectives will be identified.

CHAPTER 4: DESIGN AND IMPLEMENTAION

___________________________________________________________________

4.1 Summary

This section discusses the different design's architectures, how to implements them and how they are best of this application. Finally transforms conceptual design in to physical real application.

4.2 Architecture

This prototype application is upon three-layered architecture and Object Oriented design. "Creating a multi-tier project designs is more suitable and advisable for mid to large-size projects, whereas a good multi-layered design is suitable for small to mid-size projects" [43]. A three layered and object oriented approach is not only makes project scalable but also gives an opportunity to reuse the code in the future. This application's architecture is graphically shown in figure 4.1.

Presentation Layer

Business Logic Layer

Data Access Layer

Data

Figure 4.1 Application Architecture

4.2.1 Presentation Layer

Presentation layer forms the layout of the application. An application contains one form with some controls. This form and controls enable users to interact with application by either taking input or giving output. This presentation layer has some entities which are as under:

1 X windows form with two tabs i.e. step 1 tab for process and tab 2 for data view.

Four data source type selectors i.e. SQL Database, Microsoft Access Database, Oracle Database and ODBC type.

1 X Data grid View

4.2.2 Business Logic Layer

Business logics of an application are necessary to be fulfilled according to user's needs. Although, they vary from user to user; therefore it is necessary to keep them separate to make application flexible. The operations are frequently repeated and are programmed as functions in classes for reuse. These business logics which are implemented in this application are as follows:

User can select data source type and get all databases list.

Select database and by username and password connect with database and get all table list.

From table list user can select a target table to get all relational table with data on forms tab 2 in data gird view.

User can remove unnecessary columns from data grid view.

From data grid view user can export data into user's required format.

SQL Server Database type is only implemented in this application.

4.2.3 Data Access Layer

Data Access Layer provides the access facility to an application for a database. It enables developer to use simple methods for accessing the data rather than using the complex syntaxes every time. This application uses standard class i.e. DataProcess.cs with many methods.

Data resides in DBMS and stores the actual data in tables. These tables can be directly accessed. This Application has a reusable and scalable code as data access and business logics are implemented through classes. This application is only implemented with SQL Server at the moment.

4.3 Application Implementation

This phase will show the implementation of the application which is based some steps.

4.3.1 Step 1: Get Databases List

After selecting the data source type uses SQL query Statement No. 1 to get database list. That list will be without master, model, msdb or other system defined database.

Statement No. 1:

select name from sys.databases WHERE name NOT IN ('master', 'tempdb', 'model', 'msdb') order by name.

4.3.2 Step 2: Get Tables List

After selecting the target database uses SQL query Statement No. 2 to get all tables in selected database. This method will bring all user defined tables in the database.

Statement No. 2:

select TABLE_NAME as T_Names from INFORMATION_SCHEMA.Tables where TABLE_TYPE ='BASE TABLE' AND TABLE_NAME != 'sysdiagrams' ORDER BY T_Names.

4.3.3 Step 3: Get Target Table's Related Table List

In this step user will select target table to build model for data mining. After selection of target table, system will get by the uses of Statement No 3.1 and Statement No. 3.2.

Statement No. 3.1

SELECT C_Tbl.name AS C_Tbls FROM sysforeignkeys fk

join sysobjects C_Tbl on C_Tbl.id = fk.fkeyid

join sysobjects P_Tbl on P_Tbl.id = fk.rkeyid

WHERE P_Tbl.name = 'Target Table Name'

Statement No. 3.2

SELECT C_Tbl.name AS C_Tbls FROM sysforeignkeys fk

join sysobjects P_Tbl on P_Tbl.id = fk.fkeyid

join sysobjects C_Tbl on C_Tbl.id = fk.rkeyid

WHERE P_Tbl.name = 'Target Table Name'

After this process this step has few sub steps as under:

Build a statement to get data from all tables.

Create a temporary table and copy data into that temporary table.

Populate that temporary table to the data grid view.

This sub step has used many SQL statements to verify columns and drop all foreign and primary keys for temporary table model.

4.3.4 Step 4: Remove extra column and export data

In the step user can remove unnecessary columns in the data grid view and by one click export data to required format. It's read for data mining.

In implementation all SQL queries implemented with the help of C# classes. SQL queries most of times only used to get information from databases but all other processes implemented in C#.

4.4 User - Interface Design

User interactions depend only on the application interface because user's input and output are totally depends on it. A clear, meaningful and effective interface creates good communication between an application and an end-user. In [38, 394] mentioned user interface design can based on three golden rules such as Place user in control, minimize users' memory usage and make user interface consistent. This prototype application tries to meet user interface quality principles i.e. Structure, Simplicity, Visibly, Feedback, Tolerance and Reuse which are defined in [44].

4.4.1 Structure

Structure should be clear, user friendly and consistent also it should be organized meaningfully and purposefully. This application uses a simple windows form with some grouped controls which make it understandable for users. Form design is demonstrated in Figure 4.1.

Figure 4.2 application's main form

This first screen with step-1: select database type. In this application only SQL database type is implementing and on the selection of SQL database type step2 controls will appear as in shown in figure 4.3

Figure 4.3 - application's main form

In Step 2 user need to select the database and need to provide username and password. On success of 2nd step target table selector step 3 will appear as shown in figure 4.4.

Figure 4.4 - application's main form

In step 3 users have to select target table to build data model. By selecting target table and if this operation will successful then build model tab will appear as shown figure 4.5.

Figure 4.5 - application's main form after last step

In the last step user can remove un-necessary columns by a click on the column header as shown in figure 4.6 and then export data in required format by click on the buttons.

Figure 4.6 - model alteration

In the final phase just click on the button for export data to the required format and you will get message with file name if this operation is successful as shown in figure in 4.7.

Figure 4.7 - after final operation

4.4.2 Simplicity

A simple interface makes the information more readable and useful. This application used simple interface without any confusing controls or colours and this application's most of the operations are single clicks or selection.

4.4.3 Visibility

Clear visibility is an important aspect of a good application. This aspect helps in communicating useful information to the end user. Good visibility includes use of simple colour scheme, clear information, input and output display.

4.4.4 Feedback

Good feedback is a very important for a successful application. Users must be informed all the time because it creates understanding between user and application and feedback is very useful for this purpose. In this application some examples of feedback are as under:

Database type is not available message in figure 4.8.

Database is not selected or connection error message in figure 4.9.

Confirmation message for removing un-necessary attributes in figure 4.6.

Successful export data message in figure 4.7.

Figure 4.8 - error message

Figure 4.9 - error message example

4.4.5 Tolerance

A good application must have ability to control users and application errors. This application doesn't allow undo but it is able to tolerate and fix the errors, such as if database type is not available then it ask to select any other type as show in figure 4.8 and similar error in figure 4.9, both allow users to try again. But this application won't crash.

4.4.6 Reuse

Reusability is useful tool for developers in development because it saves time and cost as well. This application uses different components and different classes which gives the facilities to check data and formats and gives different types of objects such as data tables, datasets, list, array list, etc.

4.5 Conclusion

In given overview of the different application architecture and how implemented in this applications. After this section data model is ready for test.

CHAPTER 5: TESTING AND EVALUATION

___________________________________________________________________

5.1 Summary

This chapter discuss the test and evaluation of the developed application. Which is based on two steps first build a data model and in next that model will be tested on Weka 3.6.

5.2 Build a Data Model

In this phase building data model, this system uses Pubs database that is a built in sample database with SQL server. In process of building data model, used machine's configurations are as under:

Processor type: Intel(R) Core(TM) 2.13GHz

Memory: 3.00 GB

Operating system: Windows 7 Professional

Data Server: Microsoft SQL Server 2008

Pubs sample database [45] is a multi-relational database with 12 tables. In figure 5.1 showing it's ER Diagram.

Figure 5.1 - Pubs ER Diagram.

For building data model follows all steps as in chapter 4 and selected Publishers table to build a model as shown in figure 4.4. On selection of target table Publisher system find all the related tables of the target table, first system build list of all related tables which probably look like as shown figure 5.2 and after this step system convert all the related tables to a single temporary table as shown in figure 4.5.

Figure 5.2 - related tables ER diagram

After get all data in a table such as in figure 4.5, remove some un-necessary columns and click on export data to csv (Comma-separated values) format, this step will create a file which will be used in next set.

5.3 Data Model Test with Weka

In this section we will do two test with different databases.

5.3.1 Test - 1

This phase uses Weka 3.4.6 [46], which is free data mining software. First step load the data model in Weka and remove if any empty attribute as shown circled attribute in figure 5.3.

Figure 5.3 - remove empty attribute

After removed empty attribute we apply classifier tree J48 (implementation of C4.5 algorithm) with 10-fold cross-validation as an evaluation approach. In the result of this process we get a model shown in figure 5.3 with 77% model accuracy. Figure 5.4 is a tree view on this model.

Figure 5.3 - model results

Figure 5.4 - model tree view

5.3.2 Test - 2

In this test we follow all above step but with different database, in this test we will use Northwind Traders database [45]. It is also multi-relational database with 13 tables. After data model creation we apply weka classifier J48 algorithm and get results in as shown figure 5.5. This model's accuracy is 99.94% and model's tree view is shown in figure 5.6.

Figure 5.5 - test 2 model results

Figure 5.5 - test 2 model's tree view

5.4 Conclusion

In this chapter during testing and evaluation we used 2 databases to build data models and then used weka to see the models results.

CHAPTER 6: CONCLUSION & FUTURE WORK

___________________________________________________________________

6.1 Summary

This chapter concise and reflects the work carried out in the development of the prototype application. Lastly it suggestions for the future work is jotted down.

6.2 Conclusion

This development of the prototype application initiated with an elementary research. This included literature review with an aim to capture background information on the topic performed in past years. The inspection gave an overview of multi-relational data mining and multi-relational data mining different approaches analysis, which discussed different pre-processing techniques and current research. Background research mentioned many multi-relational data mining techniques were implemented with ILP, and Java front end. This is the motivation to design different with different technology.

Based on this motivation, this prototype application design was planned and implemented. Spiral methodology adopts to develop this application. A number of technologies were analysed and selected. The technologies used are Microsoft Visual Studio 2008 with SQL queries and application's coding was done in C#.

In chapter design and implementation, the consequent design phase illustrates how the tool will produce the desired functionality. It outlines the system infrastructure and architecture. And implementation phase deals with procumbent of the design.

By the testing process build data models and produce results by experimental tests on the data models by WEKA. During evaluation these results are acceptable.

6.3 Future Work

First this prototype software implemented only with SQL Server but it could be with other database types such as Oracle, Microsoft Access, etc. also this software given a facility to users to remove un-necessary columns before export data to required format but it could be implemented after found target table's related tables and give visual facility to remove un-necessary tables and columns.