# Improving Clustering Efficiency Of Dbscan Accounting Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract - Data mining, also known as knowledge discovery in databases, is a statistical analysis technique used for extracting previously undiscovered patterns and recognizing untapped value in large datasets. Clustering is a principal data discovery technique in data mining that segregate a dataset into subsets or clusters so that data values in the same cluster are similar in some meaningful way. In the past, a number of clustering techniques have been proposed by the researchers that can identify arbitrary shaped cluster; where a cluster is defined as dense region separated by the low-density regions and among them DBSCAN is a prime density-based clustering algorithm. DBSCAN is capable of discovering clusters of any arbitrary shape and size in databases - which even include noise and outliers. Many researchers have attempted to overcome certain deficiencies in the original DBSCAN algorithm resulting in its augmented forms like VDBSCAN, FDBSCAN, DD_DBSCAN and IDBSCAN. As part of the preliminary study for this research, the variations in different forms of DBSCAN algorithm were critically evaluated to highlight their merits and demerits. In this study, we propose an incremental density-based clustering technique which is based on the fundamental DBSCAN clustering algorithm. We have endeavored to improve time complexity of DBSCAN algorithm and overcome its key problem of dependency on the user to supply threshold values. The proposed technique has been evaluated using two-dimensional dataset which has exhibited faster cluster identification and improved efficiency over the original DBSCAN algorithm.

## Keywords:

## Clustering, DBSCAN, Data Mining, Statistical Analysis, Knowledge Discovery in Databases.

INTRODUCTION

Data mining is the process of extracting hidden and interesting distinctive patterns and affinities from large datasets and using it in decision. The

extracted patterns, rules and relationships serve as a valuable tool in the process of decision-making and future prediction. To make use of the extracted information the availability of efficient and effective analysis methods are imperative. One of such method is clustering in which a dataset of objects is divided into several clusters where the intra-cluster similarity is maximized and the inter-cluster similarity is minimized [15].

In the past, large number of clustering algorithms has been proposed by the academics and researchers for exploring attribute similarities in large datasets. The clustering techniques are categorized into partitioning, hierarchal, grid-based, density-based and model-based. Under the partitioning category, the foremost techniques include PAM [2], CLARA [2] and CLARANS [3]. The well known algorithms of hierarchical category include CHEMELEON [5]. The grid-based clustering techniques include CLIQUE [6], ENCLUS [7] and WaveCluster [8]. Among the density-based clustering techniques, DBSCAN [9], DENCLUE [10] and OPTICS [11] are generally popular. Paradigms of the model-based clustering techniques are COBWEB [12] and SOM [13]. Algorithms proposed under each of these clustering categories strive to discover proximities of the data objects on the basis of certain characteristics of one or more attributes. For this purpose, these algorithms scan large amount of data stored in database or data files. However, each of these clustering algorithms has certain limitations and deficiencies. A clustering technique is considered to be useful if it satisfies the following requirements [1]:

Minimal requirements of the domain knowledge to determine the values of its input parameters, which is very important challenge especially for large datasets.

Discovery of arbitrary shaped clusters.

Good computational efficiency on large datasets.

The density-based clustering algorithms are useful to discover clusters from the datasets with arbitrary shape. These algorithms typically cluster as dense regions of points in the data space that are separated by regions of low density. DBSCAN [9] is the first and leading density-based clustering technique. DBSCAN forms clusters with respect to a density based connectivity analysis.

In this paper, we propose a new clustering algorithm to improve clustering efficiency of DBSCAN. Our proposed algorithm is named as Incremental DBSCAN (IncDBSCAN) and is based on the original DBSCAN algorithm. We endeavor to improve time complexity of DBSCAN algorithm and overcome its key problem of dependency on the user to supply threshold values. For a given sorted dataset of objects, our algorithm first calculates the density thresholds and then explores the similarity of dataset objects to form clusters. The proposed technique has been evaluated using two-dimensional dataset which has exhibited faster cluster identification and improved efficiency.

This paper consists of seven sections. This section of the paper describes introduction to clustering techniques. A simplified description of DBSCAN algorithm is provided in Section II. Section III highlights different DBSCAN variations as an augmentation to the original DBSCAN algorithm proposed by several researchers. The model for our proposed Incremental DBSCAN method is described in Section IV. Section V highlights the Computational complexity of our proposed algorithm. Experimental details and conclusion along with future directions are discussed in Section VI and VII respectively.

DBSCAN

DBSCAN is the foremost and primary density based clustering algorithm. It was proposed by Ester et al. [9] in 1996 with the key objective to cluster data points of arbitrary shapes in the presence of noise in spatial and non-spatial high dimensional databases. The key idea of DBSCAN is based on the concept that for each object of a cluster, the neighborhood of a given radius (named as Eps) should contain at least a minimum number of objects (MinPts); this means that cardinality of the neighborhood needs to satisfy or exceed certain threshold. However, the threshold is not a fixed value rather is purely defined by the user. The most drawn in concepts in DBSCAN are: -neighborhood and MinPts. The-neighborhood of an arbitrary data point 'p' is defined as:

â€¦â€¦â€¦... (I)

Where, D is the database of objects.

If the-neighborhood of a data point P contains at least a minimal number of points then that point is called core point. So, the core point is defined as:

|â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦...â€¦.... (II)

In equation II, Eps and MinPts denote radius of the neighborhood and minimum number of points in the -neighborhood of a core point respectively and are user specified parameters. If this condition is not satisfied, then the point is considered as non-core point.

DBSCAN searches for the clusters by checking the -neighborhood of each data object in the dataset. If the -neighborhood of an object p contains more data objects than the MinPts threshold, and then a new cluster with p as its core object is formed. The algorithm then iteratively collects directly density-reachable objects from these core objects, which may possibly entail merging the new density-reachable data objects into the previously created cluster. This process terminates when no new data object can be added to any cluster [6].

The pseudo code of DBSCAN algorithm is shown in Figure.1.

## Algorithm: DBSCAN (D, Eps, MinPts)

// All objects in D are unclassified.

## Begin

FORALL objects o in D DO:

If o is unclassified

Call function expand_ cluster to construct a cluster wrt. Eps and MinPts containing o.

## End

## FUNCTION expand_ cluster (o, D, Eps, MinPts)

Begin

Retrieve the Eps-neighbourhood (o) of o;

IF | NEps (o) | < MinPts //i.e. o is not a core object

Mark o as noise point and RETURN;

ELSE // i.e. o is a core object

Select a new cluster- id and mark all objects in NEps (o) with

This current cluster-id

Push all objects fromNEps (o)| (o) onto the

Stack seeds;

WHILE NOT seeds.empty () DO

CurrentObject: = seeds.top ();

Retrieve the Eps-neighbourhood

NEps(CurrentObject) of CurrentObject;

IF| NEps (CurrentObject) | â‰¥ MinPts.

Select all objects in NEps (currentObject) not yet classified or are marked as noise,

Push the unclassified objects onto seeds and mark all of these objects with current

Current-id;

Seeds. Pop ();

RETURN

## End

Figure1: Pseudo code of original DBSCAN algorithm [4].

RELATED WORK

In this section, different variations of DBSCAN algorithm that evolved in the recent past are enunciated. This work is based on our previous literature survey and critical analysis of DBSCAN algorithm [4].

VDBSCAN [14] is the modified version of DBSCAN algorithm suggested by Liu et al. In VDBSCAN, the authors chalked out a strategy to make the existing density-based algorithm more effective and scalable by extending the original DBSCAN algorithm to analyze the datasets having varied densities. The density threshold values are calculated for different dataset densities according to a K-dist plotting, which are subsequently used to analyze the clusters with different densities. After that, the DBSCAN algorithm is applied with the calculated values of the Eps parameter. The major contribution of the authors is that they extended DBSCAN algorithm to analyze the dataset with varied densities and provide a method to calculate the Eps values automatically.

In 2004, El-Sonbaty et al. [15] provided an enhancement version of DBSCAN to generate efficient clustering results from larger sized datasets. In the preprocessing step, the dataset being analyzed is partitioned using CLARANS [3] technique. By virtue of the partitioning of dataset, searching efforts to locate the core object is minimized. The major achievements of their study include: (i) it takes a lesser amount of time to cluster the dataset by partitioning dataset and limiting the search space to only of the partitioned data objects rather than the whole dataset; (ii) as the dataset is partitioned into smaller object set, therefore, a smaller buffer size is required for holding the partitioned dataset. However, the authors did not provide any mechanism to decide the number of partitions for a dataset required by CLARANS [3]. Similar to the original DBSCAN algorithm, this technique also requires the user to provide values for Eps and MinPts parameters.

FDBSCAN [16] algorithm was introduced by Bing Liu to overcome some of the DBSCAN limitations including: (i) its slow speed (deceleration in neighborhood query due to comparisons involved for each object); (ii) and setting the threshold value. From the experimental details, author concluded that FDBSCAN can accurately cluster the datasets and it can achieve its result in lesser time as compared to DBSCAN. It is a time efficient algorithm as it decreases the computational time by ignoring the region objects which are already clustered. Before initiating the clustering mechanism, the FDBSCAN applies the kernel function on the labeled objects which results in more accurate clusters.

GRIDBSCAN [17] proposed by Uncu et al. is another important variation of DBSCAN. One of the major issues associated with most of the density-based clustering algorithms is that they are not effective to perform clustering accurately in the presence of clusters having different and uneven densities. The authors proposed a three-level clustering mechanism to provide a solution to this problem. Although GRIDBSCAN provide accurate result for clusters with different densities, but it is computationally expensive when applied on large spatial datasets.

IDBSCAN [18] is an improved version of the DBSCAN outlined by Borah and Bhattacharyya. This algorithm proposes to introduce a sampling technique into DBSCAN to make it effective while dealing with large volumes of spatial data objects. The performance evaluation of IDBSCAN shows that DBSCAN is expensive as it requires huge chunk of memory to manipulate and operate on the entire dataset objects. I/O processing cost is also quite expensive in the later technique. . As highlighted by the authors through their experimental results, the major achievement of IDBSCAN appeared to be the optimization of time required to cluster the large spatial dataset and reduction of the I/O processing cost as compared to the DBSCAN algorithm.

An enhancement of DBSCAN algorithm is provided by Ram et al. who named their algorithm as EDBSCAN [19]. Their investigation discovers the key problem in DBSCAN of not handling the local density variation within the cluster and proposes that a significant density variation needs to be allowed within the cluster to enhance its performance. EDBSCAN uses two user specified parameters and, in addition to the basic DBSCAN parameters. These additional parameters are used to specify the cutoff points to limit the amount of allowed local density variation within the cluster. The most functional contribution introduced by EDBSCAN over the DBSCAN is that it explores the local density variation within cluster that ultimately produces better results than the DBSCAN algorithm.

DD_DBSCAN [20] is another variation of the DBSCAN. The objective of this modification and enhancement is that it finds out those clusters in the dataset that differ in densities. In addition, it can also find the clusters having different shapes, sizes and differ in local density. This algorithm introduces an additional parameter 'Î±' to define the upper limit during the expansion of clusters. One of the limitations of this work is that this technique cannot handle the density variation within the cluster.

In 2008, Xiaoyun et al. [21] pointed out two problems of DBSCAN which state that: (i) it cannot choose density threshold (MinPts and) according to the distribution of data space; (ii) when it is used with large datasets then computational cost of DBSCAN is too high to cluster the dataset . To deal with these two issues, the authors proposed GMDBSCAN algorithm that introduces the concept of local MinPts. The algorithm proposed by the authors clusters the dataset in different regions by co-responding local MinPts and using grid-density as the approximation of the local MinPts. However, the clustering is made in each grid using the original DBSCAN. To cater for the noise and border point, GMDBSCAN sets a parameter according to the size of dataset. Hence, a cluster is classified as noise if the amount of data in that cluster is less than the value of this parameter. The key contribution of GMDBSCAN is that it provides a way to calculate the value of MinPts and Eps automatically to discover all clusters with arbitrary shape and separate the noise.

ST-DBSCAN [23] is another enhancement to DBSCAN. It attempts to enhance the original DBSCAN algorithm in the three following ways: (i) ST-DBSCAN clusters the spatial-temporal data according to its non-spatial, spatial and temporal attributes; (ii) ST-DBSCAN introduces a density-function for each cluster in the given dataset. This feature resolves the problem of DBSCAN of not detecting some of the noise points when dataset contains clusters of different densities; (iii) ST-DBSCAN also addresses the problem of identifying the adjacent clusters. Form the experimental results reported by the authors, ST-DBSCAN seems to be more efficient and effective when the dataset has spatial-temporal data characteristics [4].

KNNDBSCAN [22] is another practical improvement to produce better clustering results. The main problem associated with most of the density-based clustering algorithms is the determination of the global values for the density thresholds. KNNDBSCAN is based on a single user input parameter values denoted as 'K'. This parameter enables to determine the density thresholds (, MinPts) in unsupervised mode [22]. KNNDBSCAN merges two approaches to discover the arbitrary shaped clusters from the density-based datasets. These two approaches are the K-nearest neighbors and DBSCAN. The major achievements of this algorithm include: (i) automating computation of density threshold; (ii) it speeds up the clustering process as it can clusters the datasets in parallel fashion; (iii) at one time, only the partition of a dataset is taken for clustering, so it involve lesser memory consumption as the whole dataset is not required to be stored in memory during the execution of clustering algorithm [4].

The Density Clustering Based on Radius of Data (DCBRD) [1] is also the density-based clustering technique. DCBRD addresses the issue of user dependency to supply the values of density threshed parameter Eps and MinPts. The most beneficial contribution offered by DCBRD towards the density based clustering methodologies is that this technique can cluster the datasets without requiring any user supplied input parameter.

PROPOSED IncDBSCAN ALGORITHM

In this section, we present a new enhancement to the fundamental clustering algorithm DBSCAN and highlight the different processes evolved in enhancing DBSCAN. Further we evaluate the performance of IncDBSCAN using a two-dimensional dataset used in [19].

Our IncDBSCAN is based on DBSCAN [19]. IncDBSCAN extends DBSCAN in three different ways. In the first step, the sorting on dataset is performed. In the next step, the region query on the dataset is carried out in order to find those points that will be included into clusters. In the final step merging of the clusters resulted from the region query is performed. This merging produces the final clustering results. In this algorithm we have used a special variable called skipping the seed (seedSkipCount). It is used to skip those points which have not enough neighbors to form a cluster. A detailed description of these steps is given below.

## Dataset Sorting

In the proposed IncDBSCAN, the dataset is arranged in some order, either ascending or descending, and all the data points are initially flagged as noise. During the execution of the algorithm, flag of those data points that constitute a cluster is changed. Hence, at the end of the algorithm execution, only those points that have not been used in forming clusters are automatically singled out as noise.

## Region Query

Once dataset is organized in some fashion, we perform the region query starting from the point of origin i.e., both x and y coordinates have zero index. In this process, for a given Eps value, our algorithm checks the neighborhood of the data point currently being analyzed in horizontal, vertical and diagonal directions.

Increment & Decrement Coordinates

New Points

Increment X and Y by 1

(X+1,Y+1)

Increment X by 1 and decrement Y by 1

(X+1,Y-1)

Increment X by 1

(X+1,Y)

Increment Y by 1

(X,Y+1)

Decrement X by 1 and increment Y by 1

(X-1,Y+1)

Decrement X and Y by 1

(X-1,Y-1)

Decrement Y by 1

(X,Y-1)

Decrement X by 1

(X-1,Y)

Table-1: Criteria to establish neighborhood for a data

Point having coordinates as (X, Y).

The neighbors of a data point consists of all the data points that fall within the specified Eps distance value in the above mentioned directions. For a discrete dataset that constitutes an XY-plane, our algorithm can possibly check up to eight data points in the neighbor of a current data point for a unit value of Eps. Table-1 illustrates the criteria used to establish the neighborhood of a data point based on x- and y-coordinates.

The neighborhood criteria mentioned in the above table can graphically be depicted as shown in Figure. 2. In the neighborhood checking process, all those data points that are at the Eps distance and pass the MinPts criteria are merged into the cluster and are assigned the corresponding ClusterID. If the data points in the inspection of region query do not satisfy the MinPts criteria, then the seedSkipCount variable is incremented.

Figure2: Neighborhood Criteria.

When the value of seedSkipCount variable becomes two or greater than two, then a new cluster ID is generated and next available data point in the dataset with label NOISE is chosen and region query check is carried out to verify the existence of another possible cluster. This whole process is iteratively carried out until all the data points in the dataset are checked.

## Merging

In our proposed IncDBSCAN algorithm, merging of clusters is the final stage. In the merging process, two or more adjoining clusters with the identical characteristics are merged together to form a bigger cluster. The logic for merging clusters is that there is a possibility of formation of new small clusters due to the MinPts criteria. The MinPts criteria require the existence of some specific number of data points in the close proximity of a core data point. In certain cases, there is a possibility that some data points may exist in the neighborhood of a core point but their number is less than the required MinPts. In the incidence of such a situation, the algorithm will form a new cluster by skipping the seed points. This scenario leads to a new notion, although the neighbors of a core point are less than the required MinPts but the neighbors of any of region query point may satisfy MinPts criteria. In such a situation, the adjacent clusters need to be merged. An abridged version of our IncDBSCAN algorithm is provided in the Figure 3 below.

//Objects is in an Multidimensional Array of data points

## function IncDBSCAN(setOfObjects, Eps,MinPts)

//call the sort method to sort the dataset points

Objects = Sort_Objects(setOfObjects)

//Initialize each point with a NOISE flag

setOfObjects.flag = NOISE;

//set the first clusterId to 1

ClusterID = 1;

//Special variable to find which co-ordinates to skip for //checking

SeedSkipConut = 0;

//loop through the dataset points

## for loopVar = 1 to setOfObjects.Size

//get the co-ordinates of the next point in the dataset

x = setOfObjects.getNextPoint(loopVar);

//verify that the point has not been clustered before

if setOfObjects(loopVar).flag = NOISE then

seed = RegionQuery (X,Y,Eps);

//check the results returned from the RegionQuery method //for labeling the points with the current clusterId

if seed.size >= MinPts then

for each seed Do

setOfObjects(seed).flag = ClusterID;

End For Each

//assign the clusterId to the current point in the dataset at //loopvar location

setOfObjects(loopVar).flag = ClusterID;

else

SeedSkipCount ++;

If SeedSkipCount >= 2

ClusterID ++;

SeedSkipCount = 0;

End If

End If

End If

//loop ends here

## End for

End;

//call to merge method to merge the clusters with //matching characteristics

Merge_Cluster();

// end of main function

End;

Figure 3: Pseudo code of IncDBSCAN algorithm.

COMPUTATIONAL COMPLEXITY

The execution or computational complexity of IncDBSCAN mainly depends on the number of iterations carried out during the execution of its core loop and calling of the function RegionQuery(). The core loop is only executed once and is done in a linear time. Within the core loop, the RegionQuery() is called maximum of eight times to check the neighborhood of a point so its execution time can be consider as constant. Hence, the computational complexity of IncDBSCAN algorithm is linear i.e. O(n). However, our proposed algorithm requires sorting the dataset first. If sorting of dataset is carried out by using merge sort whose computational complexity is O(n log(n)), then the overall execution complexity of our proposed algorithm will be O(n) + O(n log(n)). The simplification of this computational complexity reduces to O(n log(n)), hence, the overall computational complexity of IncDBSCAN is O(n log(n)). Moreover, the experimental results described in Section VI show that the computational efficiency of IncDBSCAN is much better than the fundamental DBSCAN algorithm.

EXPERIMENTAL DETAILS

Details of the experiment and evaluation of the performance of IncDBSCAN algorithm are discussed in this section. A two dimensional dataset used in [19] is taken as test data to evaluate the performance of our proposed algorithm. We implemented our proposed algorithm by writing m-file coding in Matlab version 7.7.

Figure 4: Plain drawing of 1000 data objects.

The experiment was carried out on a Pentium system having 2GB RAM and 3.0 GHz processor. The test dataset is a two dimensional data consisting of 1000 objects. The dataset was first sorted in ascending order with respect to the x-coordinate values. The sort dataset was plotted in Matlab as shown in Figure 4.

Figure 5: Marking of clusters in the dataset.

CONCLUSION AND FUTURE WORK

In this paper, we have presented a new density-based clustering algorithm named as IncDBSCAN. This algorithm is an attempt to improve the clustering results of the basic DBSCAN clustering algorithm. To enhance the efficiency of cluster identification process, we perform region query for each data point that is marked as a member of the cluster. At the end, clusters are merged to obtain the final clustering results. The experimental results demonstrate that clustering technique of IncDBSCAN is promising as it not only identifies the correct number of clusters in a dataset but also its computational complexity is lesser than the fundamental DBSCAN algorithm. There could be further improvement to our proposed algorithm which will be part of the future research direction. Some of the future prospective include: analyzing the clustering results by using dataset of different sizes with variable number of MinPts, verifying its efficiency using the disparate nature and dimensions of datasets. Moreover, modifying IncDBSCAN to handle continuous datasets is also a potential future work.