# The Analysis Of The Geospatial Data Biology Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Geospatial analysisÂ is an approach to applying statistical analysis and other informational techniques to data which has a geographical or geospatial aspect. This geo spatial data increases the complexity to data mining task.Â In this paper we are building an associative classification on SPADA for mining a classifier. Basically we have two solutions for associative classification i.e. propositional and structural methods. The propositional approach uses spatial association rules to construct an attribute-value representation (propositionalisation) of spatial data and performs spatial classification according to well-known propositional classification methods. Since the attribute-value representation should capture relational properties of spatial data, multi-relational association rules are used in propositionalisation step. The structural approach resorts to an extension of naÃ¯ve Bayes classifiers to multi-relational data where the classification is driven by multi-relational association rules modeling regularities in spatial data. Both approaches are investigated in the context of the associative classification framework which combines spatial association rules discovery and classification by taking advantage of employing association rules for classification purposes.

Keyword: Spatial classification, Associative classification, NaÃ¯ve Bayesian classification

## INTRODUCTION:

Geospatial data is information that identifies the geographic location and characteristics of natural or constructed features and boundaries on the earth, typically represented by points, lines, polygons, and complex geographic features. This includes original and interpreted geospatial data, such as those derived through remote sensing including, but not limited to, images and raster data sets, aerial photographs, and other forms of geospatial data or data sets in both digitized and non-digitized forms. Geospatial data is a record if it is created or received in the course of doing EPA business, and it provides evidence of the Agency's actions, programs, operations and other activities. For example:

Maps used to examine development patterns and conduct scenarios for growth management and transportation planning

Locations of active and abandoned uranium mines to determine radiation risk to nearby population, wildlife and the environment

A model used to correlate children's population with toxicity risk

There are various applications related to spatial or geographic data, which has been increasing over the last decades. Some examples are route optimization, urban planning, fire or pollution monitoring, disaster management, robotics, computer vision, and, more recently, computational biology and mobile computing applications. Spatial or geo-referenced data are collected in spatial databases and Geographical Information Systems (GIS) at a rate which requires the application of automated data analysis methods in order to extract implicit, previously unknown, and potentially useful information.

Advances in database and data achievement technologies have resulted in huge amount of spatial data, much of which cannot be gladly explored using conventional data analysis techniques. The purpose of spatial data mining is to computerize the mining of exciting and of use patterns that are not clearly represented in spatial datasets.

Spatial Association Rules are association rules about spatial data objects. Either the antecedent or the consequent of the rule must contain some spatial predicates. Spatial association rules are implications of one set of data by another. The main area of concentration in this paper is to optimize the rules generated by Association Rule Mining (Apriori method) using hybrid evolutionary algorithm. The main motivation for using Evolutionary algorithms in the discovery of high-level prediction rules is that they perform a global search and cope better with attribute interaction than the greedy rule induction algorithms often used in data mining.Here we build up a strong association rules for the spatial objects using this algorthim.Later we build a classifier using Structural Classification.

In structural classification we use Bayesian classification method. Classification of spatial data can be difficult with existing methods due to the large numbers and sizes of spatial data sets. The task becomes even more difficult when we consider continuous spatial data streams. Data streams require a classifier that can be built and rebuilt repeatedly in near real time.

In this paper we focus on Bayesian classification. A Bayesian classifier is a statistical classifier, which uses the Bayes theorem to predict class membership as a conditional probability that a given data sample falls into a particular class.

## LITERATURE SURVEY

The number of applications using spatial or geographic data has been increasing over the last decades. Presence of a spatial dimension the data adds substantial complexity to the data mining tasks. In the last few years, a number of associative classification algorithms have been proposed, i.e. CBA, CPAR, CMAR and others.

## .

Various kinds of rules are discovered from spatial database. Firstly Knowledge discovery in spatial databases rises challenging for geospatial data mining problems. A promising solution approach comes from the field of inductive logic programming (ILP). It benefits from the available prior knowledge on the spatial domain, systematically explores the hierarchical structure of geographic data, and only deals with numerical spatial properties of spatial objects.

However, Discovery of recurrent patterns introduces for large data collections In this goal is to uncover structure in the data and where there is no preset target concept, the discovery of relatively simple but frequently occurring patterns has shown good promise. Moreover, discovered patterns are not relational, hence they cannot properly express spatial relations.

Then Relational frequent patterns are generated by WARMR (warmer inductive logic programming) which presented a powerful inductive logic programming algorithm, it allows the use of variables and multiple relations in patterns, and it thus signiï¬cantly extends the expressive power of patterns that can be found. It is ï¬‚exibility for discovery of frequent patterns and also having difficulty for spatial data analysis that provides no support to extract properties of reference or task-relevant objects from spatial databases

CBA(Classification Based on Associations) mining was introduced for associative classification. It consists rule generator which is based on algorithm Apriori algorithm for finding association rules and a classifier builder (called CBA-CB). The new framework not only gives a new way to construct classifiers, but also helps to solve a number of problems that exist in current classification systems

Classification rule mining discover a small set of rules that forms an accurate classifier. Association rule mining finds all the rules existing in the database that satisfy some minimum support and minimum confidence constraints. For association rule mining, the target of discovery is not pre-determined, while for classification rule mining there is one and only one predetermined target. The integration is done by focusing on mining a special subset of association rules, called class association rules (CARs). Results show that the classifier built more accurate than other.

CBA (Classification Based on Associations) mining was introduced for associative classification. It consists of two parts, a rule generator which is based on algorithm Apriori algorithm for finding association rules and a classifier builder (called CBA-CB). The new framework not only gives a new way to construct classifiers, but also helps to solve a number of problems that exist in current classification systems.

Recently introduced Multi-relational data mining overcomes limitations imposed by single table. This method able to extract data stored from multiple relational tables. Multi- relational data mining methods are typically based on: a structural approach and a propositional approach. Implementing of structural approaches was more powerful than propositional approaches because information about how data were originally structured is not lost. It reduces search space to a minimal subset including features obtained as transformation of the original multi-relational feature space.

Original multi-relational problem into a single table format allows one to directly apply conventional propositional data mining method, Generally, Assume there is a one-to-one correspondence between each tuple in the original target table and each tuples in the single table obtained after the propositionalisation process.( ) At now, multi-relational classification through propositionalisation problems has been extensively investigated by resorting to the field of Inductive Logic Programming (ILP).

Nevertheless, approach to propositionalisation is supported by systems that directly work with relational databases. Indeed, they generally construct a single central relation by simply summarizing and/or aggregating information found in other tables.

A multi-relational propositionalisation-based classification framework that makes use of discovered multi-level association rules. Discovered rules are subsequently used to create a relational table. Moreover, a feature reduction algorithm has been integrated to remove redundant features and improve efficiency of classification without affecting accuracy of classifier. More algorithm are describes in table 1.1 according to characterization features.

Name

Data Layout

Rule Discovery

Ranking

Pruning

Predication Method

References

CBA

Horizontal

Apriori candidate

generation

Support, confidence, rules

generated first

Pessimistic error, database

coverage

Maximum likelihood

Liu et al. (1998)

CMAR

Horizontal

FP-growth approach

Support, confidence, rules

cardinality

Chi-square, database

coverage, redundant rule

CMAR multiple label

Li et al. (2001)

CPAR

Horizontal

Foil greedy

Support, confidence, rules

cardinality

Laplace expected error

estimate

CPAR multiple label

Yin & Han (2003)

ADT

Horizontal

Upward cluster property

Support, confidence, rules

cardinality, items lexicographical

Pessimistic error, redundant

rule

Maximum likelihood

Wang et al. (2000)

ARC-AC

Horizontal

Apriori candidate

generation

Support, confidence

Redundant rule

Dominant factor multiple

label

ZaÃ¯ane & Antonie

(2002)

CAAR

Horizontal

Multipass Apriori

Support, confidence, rules

generated first

Database coverage similar

method

Not known

Xu et al. (2004)

Table 1.1

## Detailed Problem Definition

Traditional spatial analysis methods were developed in an era when data were relatively scarce and computational power was not as powerful as it is today (Miller & Han, 2009). Facing the massive data that are increasingly available and the complex analysis questions that they may potentially answer, traditional analysis methods often have one or more of the following three limitations.

Most existing methods focus on a limited perspective (such as univariate spatial autocorrelation) or a speciï¬c type of relation model (e.g., linear regression).

Many traditional methods cannot process very large data volume.

Newly emerged data types (such as trajectories of moving objects, geographic information embedded in web pages, and surveillance videos) and new application needs demand new approaches to analyze such data and discover embedded patterns and information.

## Solution Methodology

The main motto is to build Associative classification which will help to reduce the above problem. So we proposed Apriori Algorithm for Association Rule mining and Structural approach i.e Multi level NaÃ¯ve Bayesian for classification.

Spatial association rule mining

Association rule mining was originally intended to discover regularities between items in large transaction databases which is basically based on Apriori i.e SPADA (Agrawal, Imielinski, & Swami, 1993).

Let S = {i1, i2, . . ., im} be a set of items (i.e., items purchased in transactions such as computer, milk, bike etc.). Let D be a set of transactions, where each transaction T is a set of items such that T S. Let X be a set of items and a transaction T is said to contain X if and only if X T.

An association rule is in the form: X Y, where X S; Y S and X Y = .

The rule X Y holds in the transaction set D with conï¬dence c if c% of all transactions in D that contain X also contain Y. The rule X Y has support s in the transaction set D if s% of transactions in D contains X Y.

Support (XY) = Support count of (XY) / Total number of transaction in D.

Conï¬dence denotes the strength and support indicates the frequencies of the rule. It is often desirable to pay attention to those rules that have reasonably large support (Agrawal et al, 1993).

Confidence (X|Y ) =Support(XY ) / Support(X)

A rule "X Y /S" is strong if predicate XQ is large in set S and the confidence of

"X Y /S" is high.

Example:

is a(X, town)intersects(X, Y) is a(Y, regional road) â†’intersects(X, Z) is a(Z, main trunk road), Z Y (65%,71%)", which states that

"if a town X intersects a regional road Y then X intersects a main trunk road Z distinct from Y with 65% support and 71% conï¬dence.

## Structural Classification Approach : Multi level NaÃ¯ve Bayesian

After extracting set of rules from spatial rule mining for each level, we used in the construction of a naÃ¯ve Bayesian classifier , which aims to classify any target object oâˆˆS by maximizing the posterior probability P(Ci|o) that o is of class Ci.

i.e class(o)= arg maxi P(Ci|o)

By Bayes theorem, P(Ci|o) can be reformulated as follows:

P(Ci|o) = P(Ci )P(o|Ci)/P(o)

The term P(o|Ci) is estimated by means of the naÃ¯ve Bayes assumption:

i.e P(o|Ci)=P(o1,o2,â€¦ ,om|Ci)=P(oi|Ci) Ã-P(o2|Ci) Ã-â€¦Ã-P(om|Ci)

where o1,o2,â€¦,om represent the set of the properties, different from the class, used to describe the object. This assumption is wrong if the predictor variables are statistically dependent.

Given the object oâˆˆS, we consider the subset of the extracted rules that can be used

to classify o. More formally, we consider the subset R of rules whose body is satisfied

by the object to be classified both in terms of the values of properties of involved

spatial objects and in terms of the spatial relations between objects. For example, if S

is the set of wards in a district, a ward w satisfies the rule:

wards_relatedTo_waters(A, B)waters_typewater(B, river)cars_per_person(A, high)

mortality_rate(A, low)

We use R to estimate P(o|Ci). In particular, we estimate P(o|Ci) by means of the probabilities associated to both spatial relations (e.g. wards_relatedTo_waters(A,B)) and properties (e.g. waters__typewater (B,RIVER), cars_per_person (A,high)) associated to each rule in R.

For instance, if R = {R1, R2}, where R1 and R2 are two association rules of class Ci

Extracted by SPADA:

R1:Î²1,0 : âˆ’Î²1,1 , Î²1,2

R2:Î²2,0 : âˆ’Î²2,1 , Î²2,2

where Î²1,1 and Î²2,1 are spatial relations,

Î²1,2 and Î² 2,2 are properties and Î² 1,0= Î² 2,0(class) then

P({R1,R2}|Ci) = P( Î²1,0âˆ© Î²1,1âˆ© Î² 2,1âˆ© Î²1,2âˆ© Î² 2,2|Ci)= P(Î²1,0âˆ© Î²1,1 âˆ© Î²2,1|Ci) â‹…P( Î²1,2âˆ© Î²2,2|

Î²1,0âˆ© Î²1,1âˆ© Î²2,1âˆ©Ci)

The first term takes into account the relations of the rules while the second term refers

to the conditional probability of satisfying the property predicates in the rules given

the relations. By means of the naÃ¯ve Bayes assumption, the probabilities can be factorized as follows:

P( Î²1,0âˆ©Î² 1,1âˆ©Î² 2 ,1|Ci)=P( Î² 1,1|Ci).P(Î² 2,1|Ci),

P( Î²1,2âˆ©Î² 2,2|Î²1.0âˆ© Î² 1,1 âˆ©Î² 2,1âˆ©|Ci)=P(Î²1,2| Î² 1,1 âˆ©Î² 2,1âˆ©Ci). P(Î² 2,2| Î² 1,1 âˆ©Î² 2,1âˆ©Ci )

Since Î²1,2 and Î²2,2 do not depend from Î²2,1and Î²1,1 respectively,Hence

By generalizing to a set of rules we have:

P(Ci)P(o|Ci) = P(Ci) âˆ (P(relationsk| Ci) âˆP(propertyk,j |relationsk,Ci))

kâˆˆ |R| j

where the term relationsk represents the event that the set of spatial relations expressed

in the k-th rule is satisfied, while the term propertyk,j represents the event that the j-th property of the k-th rule is satisfied.

If relationsk = { relation(Set1,Set2) | Set1,Set2 {S} {Rk, 1â‰¤kâ‰¤m}, Set1â‰ Set2 } is a set of binary relations between spatial objects (either task relevant or reference) involved in the k-th rule, the probability P(relationsk|Ci) is computed by means of the naÃ¯ve Bayes assumption:

P(relationsk|Ci) = Î P(relation(Setl1,Setl2) |Ci)

l|relation sk |

where:

P(relation(Setl1,Setl2) |Ci) = P(relation(Set'l1,Set'l2) ))= | relation(Set'l1,Set'l2 ) |

|Set'l1,| . |Set'l2 |

where Set'l is a subset of objects in Setl that are related, by means of spatial relations,

with objects in S of class Ci, while | relation(Set'l1,Set'l2 ) | is the number of relations between objects of Set'l1 and objects of Set'l2 .

To compute the probability P(propertyk,j | relationsk ,Ci ) we use the Laplace

estimation:

P(propertyk,j |relationsk.,Ci )=|relationsk ^ propertyk,j ^Ci | +1|

|relationsk ^ Ci|+F

where F is the number of possible admissible values of the property. Laplace's estimate

is used in order to avoid null probabilities in equation In practice, the value at the nominator is the number of target objects of class Ci that are related to other spatial objects by means of spatial relations expressed in relationsk and for which propertyk,j is satisfied. The value of the denominator is the number of target objects of class Ci that are related to other spatial objects by means of spatial relations expressed in relationsk plus F.

In order to avoid the problem that the same relation or the same property is considered more than once in the computation of probabilities in formula, the values computed in formula are effectively determined and included in formula only if the values have not been computed before.

## Conclusions

In this paper we have presented a spatial associative classifier that combines spatial Association rule discovery with naÃ¯ve Bayes classification. The methods developed in this paper explore efficient mining of spatial association rules at multiple approximation and abstraction levels. It proposes first to perform less costly, approximate spatial computation to obtain approximate spatial relationships at a high abstraction level and then refine the spatial computation only for those data or predicates, according to the approximate computation, whose refined computation may contribute to the discovery of strong association rules.

Finally, for each granularity level, extracted rules concur in building the spatial classification model by exploiting a multi-relational naÃ¯ve Bayesian classifier. The method investigated in this study is currently under implementation and experimentation.

We plan to integrate this technique with the generalization based spatial data mining technique developed before and will report the prototype implementation and the experiments with reasonably large spatial databases in the future.