# Study On Problems In Geographical Information System Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The purpose of this paper is to present K-mean++ clustering algorithm K-means++ is an algorithm for choosing the initial values for the k-means clustering algorithm. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii as an approximation algorithm for the NP-hard k-means problem---a way of avoiding the sometimes poor clustering found by the standard k-means algorithm.The k-means problem is to find cluster centers that minimize the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

However, the k-means algorithm has at least two major theoretic shortcomings:

First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.

Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering.

In k-means++ addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O (log k) competitive to the optimal k-means solution.

## PROPOSED WORK

In this paper we proposed KVISIMINE++, is the combination of K-mean++ clustering and Visimine.

## K-means++ clustering.

With the intuition of spreading the k initial cluster centers away from each other, the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.

The exact algorithm is as follows:

1. Choose one center uniformly at random from among the data points.

2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.

3. Add one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x) 2.

4. Repeat Steps 2 and 3 until k centers have been chosen.

5. Now that the initial centers have been chosen, precede using standard k-means clustering.

This method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very fast. But lowers the computation time too. The method provides with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets close to 1000-fold improvements in error. Additionally, we calculate an approximation ratio for their algorithm. The k-means++ algorithm guarantees an approximation ratio O (log (k)) where k is the number of clusters used. This is in contrast to k-means, which can generate clustering arbitrarily worse than the optimum.

## Database Organization

Visimine: is a system for data mining and statistical analysis of large collections of remotely sensed images. Visimine system provides the infrastructure and methodology required for the analysis of satellite images. The query language allows the user to specify the type of knowledge to be discovered, the set of data relevant to the mining process, and the conditions that have to be satisfied by the data. Based on this query, an SQL statement is constructed to retrieve the relevant data. This new domain requires expertise in image processing, database organization, pattern recognition, content based retrieval and data mining: image processing indicates the understanding and extraction of patterns from a single image; in this system provides users the capability to deal with large collections of images by accessing into large image databases and also to extract and infer knowledge about patterns hidden in the images, so that the set of relevant images is dynamic, subjective and unknown. It enables the communication between heterogeneous source of information and users with diverse interests at high semantic abstraction. The graphical user interface enables browsing and manipulation of the images and associated features, creation of data mining queries, and visualization of the results of the data analysis.

Fig 2. Visimine System

Visimine uses an SQL-like query language that enables specification of the data mining task, features to be used in the mining process, and any additional constraints. The system is capable of performing similarity searches based on any combination of features. In Visimine we use a method for training of land cover labels that employs naÃ¯ve Bayesian classifiers. Visimine is based on decision tree models. In MatLab, is a state-of-the-art mathematical software package, which is used extensively in both academia and industry. It is an interactive program for numerical computation and data visualization, which along with its programming capabilities provides a very useful tool for almost all areas of science and engineering. Visimine data can be accessed from within MatLab by using Java connectivity for images and ODBC connectivity for image and region data. The Visimine can also display graphics, which are created using a command line interface and shown within MatLab figure window. The combination of MatLab and Visimine features creates a unique environment for interactive exploration and analysis of remotely sensed data.

## PRELIMINARIES

Problem definition: Modern query languages such as SQL are not sufficient in either performance or sophistication for much of the major development required in a GIS system - but then one would argue that they were not intended for this. One can see why people like SQL; it can give immense power in return for some fairly simple "select" constructs. A problem which has to be addressed is spatial queries within the language, since trying to achieve this with the standard set of predicates provided is extremely difficult and clumsy. If the route adopted is to provide two databases in parallel, a commercial one driven by SQL and a geometry database to hold the graphics, then there is a problem constructing queries that address both databases. Ideally, the query language should be a natural subset of the front end language allowing access to the same seamless environment that the front end language provides. Much work needs to be done in the area of query languages for GIS.

## EXPERIMENTAL RESULTS

The KVISIMINE++ can also display MatLab graphics, which are created using a command line interface and shown within figure window.

Results over different variations of k-means algorithm using a tree image classified according colors.

( Total number of records present in dataset = 70 )

Clustering Algorithm

Correctly Classified

Average Accuracy

k-means

68

94.88

k-means++

70

95.83

The combination of MatLab and Visimine features creates a unique environment for interactive exploration and analysis of remotely sensed data and large images. MatLab is a state-of-the-art mathematical software package, which is used extensively in both academia and industry. It is an interactive program for numerical computation and data visualization, which along with its programming capabilities

Provides a very useful tool and this allows for easy and powerful customization of the data analysis process. The Visimine provides the infrastructure and methodology required for the analysis of satellite, land, water and other images. In order to facilitate the analysis of large amounts of image data, we extract features of the images. Large images are partitioned into a number of smaller (segmentation), more manageable image tiles. Partitioning allows fetching of just the relevant tiles when retrieval of only part of the image is requested, and provides faster segmentation of image tiles. Individual image tiles are processed to extract the feature vectors.

PERFORMANCE EVALUATIONS

MatLab connectivity: is an interactive computing Environment for graphics, data analysis, statistics, and mathematical computing. Data was then transferred from MatLab to MS Access using the database connectivity (ODBC) tools as provided by the MatLab Database Toolbox. The data was then transferred from MatLab across the LAN to the SQL Server 7. This process is repeated for matrices with 4 columns per row then 253 columns per row. Each matrix contained 1000 rows. Once the MatLab process was complete, MatLab was closed and MS Access opened. A process was then run that gathered the timestamp information for each row written to the MS Access tables and the SQL Server 7 tables. The SQL Server 7 tables were then emptied, and the row data in MS Access was written to SQL Server 7.It contains a superset of the S object-oriented language and system originally developed at AT&T Bell Laboratories, and it provides an environment for high-interaction graphical analysis of multivariate data, modern statistical methods, data clustering and classification, and mathematical computing. In total, MatLab contains over 3000 functions for scientific data analysis. Visimine data can be accessed from within MatLab by using Java connectivity for images and ODBC connectivity for image and region data. In addition, Visimine has the MatLab command tool, which provides for easy transfer of images, and for data processing. The KVISIMINE++ display MatLab graphics, which are created using a command line interface and shown within figure window. The combination of MatLab and KVISIMINE++ features creates a unique environment for interactive exploration and analysis of remotely sensed image and data. The rich statistical functionality of MatLab, together with the approach user interface and the scalability of its data mining engine, allows for easy and powerful customization of the data analysis process.

Figure 3.a Graph for k-mean

Figure 3.b Graph for k-mean++

According to graph we get better quality of clusters we can use these concepts. And k-means initial selection of cluster centers plays a very important role.

## CONCLUSIONS

In this paper we presented a KVISIMINE++ MatLab framework, provides powerful numeric engine and technical programming environment with interactive exploration and visualization tools, MATLAB has become the language of technical computing, which explores state-of-the-art data mining and databases technologies to retrieve integrated spectral and spatial information from Geographical information system imagery. A scalable data warehouse containing a huge amount of images may be a better database architecture for fundamentally distributed data management and mining system such as NASA Earth Observing System (EOS). Meanwhile, performance analysis for clustering on and retrieving from large volumes of images is critical for the system to succeed in practical applications. And the results of experiments on the basic of images show that the proposed approach can greatly improve the efficiency and performances of image retrieval, as well as the convergence to userââ‚¬â„¢s retrieval concept. Clustering algorithm has been widely used in computer vision such as image segmentation and Visimine is able to distinguish between pixel, region and tile levels of features, providing several feature extraction algorithms for each level. In addition, current implementation provides data and image based search. A segmentation process can be used to segment an image into non-overlapping regions on which we can further apply the texture feature extraction.