# Background and literature review of pre processing

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The chapter discusses the general concepts pre-processing techniques for multi relational data mining, including data mining and research about multi relational data mining and different kinds of techniques, are discussed in this chapter.

2.2 Overview of Data Mining

Data mining is not only collecting and managing data. Data mining is involved to extracting the hidden predictive information/knowledge from large databases [1]. Data mining can be performed on different kinds of data such as quantitative, textual, multimedia forms and etc. This powerful technology can help companies to focus on the very useful and important information in their data warehouses which can help them to predict future trends and behaviours of businesses.

The objective of data mining research is to develop more efficient solutions for to discover an interesting knowledge and previously unknown knowledge from larger databases [2; 3].

As [1; 4] mentioned is the central and essential process in Knowledge Discovery in Databases (KDD) and [5] mentioned KDD process consists 3 subtasks. In the first task fit the input format of data mining algorithm of original data format (pre-processing the data). After data formation, for the extraction of interesting patterns, regularities or general laws from data, one or more algorithms could be applied and this phase called proper data mining. The last phase called post-processing of results in which may be obtained results by data mining process need to translate in to more intelligent format.

There are two wide categories in which we can classify the data mining algorithms predictive or descriptive. Predictive data mining the process which is use for automatically creating a classification model from given examples,, the most popular example of predictive techniques is decision tree induction (e.g., Quinlans C5) [5]. On the other side, descriptive data mining technique is to characterize data by the help of finding patterns and regularities in the given examples. Descriptive data mining could be classified into clustering, association and sequential analysis [7]. Discovering of association rules is its example mentioned in [8].

2.3 Multi Relational Data Mining Approaches

In this section we will discuss some approaches which have been purposed in the past.

2.3.1 Inductive Logic Programming

Inductive Logic Programming (ILP) is situated at the point where two computer science areas are meeting. One is Induction, this is the main technique which used by many Machine Learning algorithms to create models that generalize beyond specific instances and the second part is Logic Programming, this is the programming paradigm which helps to first order logic for the demonstration of relations between objects and implementing deduction reasoning. Prolog is the main representative of the paradigm.

At early stages of ILP focused on developing algorithms for the creation of logic programs by the help of examples and background knowledge (i.e. knowledge acquisition for some domain). Recent ILP developments have considered classification, regression, clustering, and association analysis [4]. It is very flexible and expressive way of representation of background knowledge and examples, the field has been expanded from single table case to multiple table representation.

Here is small ILP example taken from [9].

Let E+ = {daughter (Mary, Ann), daughter (Eve, Tom)} be the daughter relations positive examples,

E- = {daughter (Tom, Ann), daughter (Eve, Ann)} be the same relations negative examples; and

B = {mother(Ann, Mary), mother(Ann, Tom), father(Tom, Eve), father(Tom, Ian), female(Ann), female(Mary), female(Eve), male(Pat), male(Tom), parent(X,Y)

? mother(X, Y), parent(X, Y) ? father(X, Y)} be the background knowledge, where the relations daughter, mother, father, female, male, and parent have the common meaning. Then the goal of this ILP example is to learn daughter concept. For the predictive ILP solution could the following synthesized clause:

daughter(X, Y) ? female(X), parent(Y, X).

or a set of definite clause:

daughter(X, Y) ? female(X), mother(Y, X).

daughter(X, Y) ? female(X), father(Y, X).

The induction process is harder than deduction and inverse process of deduction. Basically deduction process is based on the use of sound rules of inference but inductive inference involves unsound conjunction based on statistical support from data [10], this process makes very challenging to create well recognised ILP engines same as those are present in inductive logic programming with Prolog.

ILP has been first and most expanded approach among the other learning approaches. However some differences between input specification and language bias specifications of different ILP engines reduce its use in relational data mining approaches [11]. Unified Modelling Language (UML) has been proposed as common specification language in [11] to deal with formalism and to unify the different input specifications for different ILP engines and for the large ILP engines. The idea was to replace the logic-based formalism to specify the language notation with a language that can be simple useable, can be used by quite a lot of of ILP systems, and can be denote the complexity of the problem in clear and simple way. This way makes very easy and understandable, even the non-expert will be comfortable to model their problems and use a number of range of ILP engines.

There are three ways to connection between ILP and relational databases, which are discussed in [13] and briefly described below:

a) Pre-process the relational database into Prolog syntax is the first approach.

b) The second approach is to build link between a Prolog Systems and relational databases. A relational database is given with the Prolog knowledge base. Each time a literal p (a, b) has to be evaluated, where the predicate p corresponds to a relation P in the relational database, Prolog has to open a connection with the database and make corresponding query to determine whether the tuple (a, b) is in P.

c) The third way exploits the close interactions among the first order logic and relational databases (a predicate is relation between its arguments which are, in turn, attributes in relational databases) and between logical clauses and relational database queries. Therefore, entire logical clauses (not only a literal at a time) can be transformed to SQL queries and submitted to the database to get the statistics. These systems implemented solutions already exist. One of the systems is RDT/DB [13; 12], which couples the ILP system RDT [14] with the Oracle database system.

There are several different ILP engines existing and most of them can be sited within one of two main logical approaches: learning from entailment and learning from interpretations. Explanatory ILP which can get from entailment approach [5] and most of ILP systems are learning from entailment such as RDT [14], Prolog [10], FOIL [15], SRT [16] and FORS [17]. Learning from interpretations is called descriptive ILP. Examples of the ILP engines which are based on this approach are Claudien, ICL and Tilde [8]. In [5] was showing how learning from interpretations evolved descriptive techniques to predictive one. There more approaches to ILP can be found in [18].

ILP's two main approaches are different from each other in the way of representation of examples, background knowledge and the way final hypothesis induced. In the learning by interpretations each example e is denoted by a separate Prolog program encoding its specific properties as sets of interpretations and the B is represented background knowledge in the form of different Prolog program. Entailment paradigms learning represents all data examples together with background knowledge as a simple Prolog program and there are not separations between examples. In learning from interpretations setting, a hypothesis is a set of clauses because of that only each positive example with the help of background knowledge makes each clause in the set true. In notation, let E+ and E- be the sets are for the all positive and negative examples respectively and H hypostasis has to found such that:

e E+ : H is true in M(e B)

e E- : H is true in M(e B)

Where M (e B) is minimal Herband model of e B.

Learning from entailment basis, to find H is the induction problem such that the following constraint holds

e E+ : H H B entails e

e E : H H B does not entails e

There are two main approaches to induction process in the learning from entailment. One is invers deduction and other one is a generalization of top-down induction techniques to the first order case. In [10] mentioned Prolog is an example of system which is using inverse deduction technique. This example uses inverse entailment technique which is a generalization and enrichment of the inverse deduction. Example to second approach is FOIL system [15] which is the first ILP systems using top-down induction. Most of interpretation learner ILP systems uses top-down induction hypothesis. Although it has been shown learning form interpretations reduces learning from entailment [18], which also reduces to learning from satisfiability (learning from partial interpretations, where missing values are considered), an increased interest has been concentrated on learning from interpretations in numerous recent ILP systems.[5, 8]. The attribute value representations are the important reason for this special case setting. But many attribute value techniques exploit the independence of examples. The learning from interpretations setting also assumes this freedom through the separation of information between examples. Because of that attribute value learning algorithms can be upgraded in a trivial way to the learning from interpretations setting than to the learning from entailment and this has been presented by the systems Claudien, ICL and Tilde [8]. The learning by interpretations setting has also inspired the shifting from pure logical search space to one consisting exclusively of database statements in relational algebra or SQL like language in [12; 19; 20].

2.3.2 First Order Extension of Bayesian Networks

Probabilistic Models specifically Bayesian Networks (BNs) [21] different from the Inductive Logic Programming approaches by identifying a probability distribution over a fixed set of random variables that could be the attributes in an attribute-value setting. Thus, Bayesian Networks are a probabilistic extension of propositional logic [22]. The most significant features of these models are their capability of reasoning under uncertainty. But these models carry same limitations as propositional logic when they applied to relational data.

In order to handle these limitations, several approaches have been proposed to combine first order logic and Bayesian Networks in the literature. Different techniques for this problem have been appeared in different fields. The most prominent ones are Probabilistic Logic Program (PLPs) [23] from logic programming; Relational Bayesian Network (RBNs) [24] from model theory, and Probabilistic Relational Models (PRMs) [25]. In spite of their different backgrounds, they all seem to share most essential information represented in Bayesian Logic Programs (BLPs) as shown in [22]. The last approach is simplification and reformulation of PLPs.

Qualitative and quantitative are the two important components of Bayesian networks. The directed acyclic graph (DAG) is showing representation of the qualitative, where each node denotes a domain variable and each arc represents probabilistic dependency between two variables. The probabilistic dependency is the quantitative portion which represented through a conditional probability distribution for each node.

Probability distribution via conditional independence can be represented compactly in BN. A BN can calculate the conditional probability for each values node given that values have been assigned to the other nodes. Similarly BN classifier can compute conditional probability of the class value given the values of the other attributes.

There are several motivations for the use of BNs for relational data mining. The BNs main advantage is that they can represent relationships naturally among attributes and this is clearly needed in relational domains. Also BNs structural representation technique make easy to understand between attributes and domain experts could modify easily to obtain better predictive models.

There is one drawback of BNs; generally the domain of each random variable has to be finite. Thus, before learning the network and its parameters must be used some of discretisation algorithms, because of then sometime results lost important information.

2.3.3 Multi-Relational Data Mining

In [12] initially proposed an idea of moving from a pure logical search space to a search space based on database queries. In [26] was shown implementation of this algorithm and that algorithm is for relational algebra, although database query language such as SQL is straightforward.

That work was motivated because most of the ILP approaches were with a deductive database such as Prolog database, rather than relational database. Therefore, systems that exploit relational database capabilities directly were needed.

A multi-relational data mining (MRDM) framework (19) was proposed later by same group. This system was based on ILP techniques and search space consists of SQL queries.

2.3.4 Other Recent Approaches and Extensions

The way of transforming a multiple relations database into a single relation database with help of propositional learning algorithms and also it create new attributes in a central relation that summarizes or aggregate data from other tables in the relation database. This methods variants have been used in systems such as LINUS [27] and DINUS [28]. These systems are examples of ILP methods which transform restricted ILP problems into attribute-value form by propositionalization technique and solve the transformation problem with help of propositional learning algorithm. Because of some limitations these two systems cannot tackle relational database problems such as non-determinate relationships (a record in one table can have more than one corresponding records in another table).

In [29], has been introduced a new ILP learner which was transformation-based and was able to deal with non-determinate relationship, non-constrained literals and it was business domain oriented because of simple and large size relational databases. This approach is known as Rely on Aggregation and achieved significant results from business domain [29] and biology domain [30]. This feature construction process shows that if good attributes are created by the propositionalization algorithm, a feature-based learner can outperform relational learning technique.

Continuing with same trend, [31] proposed the usage of a simple propositional Bayesian classifier in an iterative way for relational domains. This approach keeps the relational structure and the objects are flattened to fulfil the algorithms requirement. Suggestions made about an object can assist suggestions about related objects; therefore, suggestions made with high confidence for some objects are fed back into the database and used for posterior suggestions about possible related objects. Although simple Bayesian classifier is an attribute-value leaner, keeping the relational structure of the data helps to perform the feedback procedure into related objects.

Structure data mining approaches have been proposed as well in the form of graph in [32; 33]. In this work there is corresponding between objects in the data and vertices in the graph, also relationships between objects correspond to focused or unfocused edges in the graph. The Subdue looks patterns embedded in graphs. Once a design (substructure) found, it is add to the graph by replacing instances of the substructure with the substructure itself.

2.4 Conclusion

In this chapter explore most of the data mining techniques especially multi-relation data mining and their all techniques in details. Most of approaches are by ILP with Prolog database and this section discussed some other approaches as well all. This chapter also discussed all approaches advantages and disadvantages.