Performance Analysis Of Cluster Evaluation Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

To present an indexing technique for the real high-dimensional data sets focusing on the clustering paradigm for search and retrieval. Major problem is to prune irrelevant clusters based on bounding hyper spheres and bounding rectangles, whose lack their efficiency in exact nearest neighbor search. It is conquer by cluster-adaptive distance bound based method on unraveling hyper plane boundaries of Voronoi clusters to complement our cluster based index. This bound facilitate efficient spatial filtering, with a comparatively small preprocessing storage overhead. To meet this confronts, a High-Dimensional Indexing scheme is introduced for the scalable with data set size and data dimensionality.

There are various methods that have been discussed in high-dimensional indexing scheme and cluster distance bounding in the multimedia databases. In this paper we analyze current clustering method and provide an overview of the emerging adaptive cluster distance bounding based on indexing scheme and related research work done in this area. Also comparisons are made between the various schemes to explain the advantages and limitations. In this paper, the experimental evaluation shows that the performance analysis of the adaptive cluster distance bounding for high dimensional indexing scheme in multimedia database on the basis of tuning time, energy usage, accuracy and execution time.

Keywords: Similarity Search, High-Dimensional Data Set, Cluster- Adaptive Scheme, Indexing Technique, Distance bound, Curse of Dimensionality.

1 INTRODUCTION

Clustering is a significant and expensive ability in the data mining field. For high-dimensional data, current research statement of traditional clustering techniques suffers from the problem of determining meaningful clusters due to the curse of dimensionality. A common approach to cope with the curse of dimensionality problem for mining tasks is to decrease the data dimensionality by using the techniques named Vector Approximation (VA).

A popular and effective technique to conquer the curse of dimensionality is the Vector Approximation file. VA-File partitions the space into hyper rectangular cells, to get hold of a quantized approximation for the data that exist inside the cells. Nonempty cell locations are programmed into bit strings and stored in a separate approximation file, on the hard disk. During a nearest neighbor search, the vector approximation file is successively scanned and upper and lower bounds on the distance from the query vector to each cell are predictable. The bounds are used to prune immaterial cells. The final set of candidate vectors are then read from the hard disk and the precise nearest neighbors are resolute.

VA-File was going behind several more recent techniques to conquer the curse of dimensionality. In the VA+ File the data set is turned around into a set of uncorrelated dimensions, with more approximation bits being supplied for dimensions with higher variance. The approximation cells are adaptively spaced according to the data distribution. Crucial to the efficiency of the clustering-based search strategy is well-organized bounding of query-cluster distances. This is the mechanism that allows the removal of irrelevant clusters. Conventionally, this performs with bounding spheres and rectangles. However, hyper spheres and hyper rectangles are generally not optimal bounding surfaces for clusters in high-dimensional spaces.

The principle is that, at high dimensions, substantial improvement in competence can be achieved by relaxing restrictions on the regularity of bounding surfaces (i.e., spheres or rectangles). Specially, by creating Voronoi clusters, with piecewise-linear boundaries, allows more general convex polygon structures that are able to efficiently bind the cluster surface. With the construction of Voronoi clusters under the euclidean distance measure, this is possible. By projection onto these hyper plane boundaries and complementing with the cluster-hyper plane distance, develops an appropriate lower bound on the distance of a query to a cluster.

High-dimensional indexing methods are based on the principle of hierarchical clustering of the data space. The data vectors are stored in data nodes such that spatially adjacent vectors are probable to exist in in the same node. Each data vector is stored in exactly one data node i.e., there is no object repetition among data nodes. The data nodes are organized in a hierarchically structured directory. Each directory node points to a set of sub trees. Usually, the structure of the information stored in data nodes is totally different from the structure of the index nodes. In contrast, the directory nodes are consistently structured among all levels of the index and consist of (key, pointer) tuples.

The key information is diverse for dissimilar index structures. It hand out as an entry point for query and update processing. The index structures are height-balanced. That means the lengths of the paths between the root and all data pages are the same, but may change after insert or delete operations. The length of a path from the root to a data page is called the height of the index structure. The length of the path from a random node to a data page is called the level of the node. Data pages are on level zero.

2 LITERATURE REVIEW

A new well-organized high-dimensional indexing scheme gratifies the requirements of the content-based retrieval of a large amount of video data. In addition [1, 2], supply the insertion algorithm and a k-NN search algorithm for our high-dimensional indexing scheme. Finally, our high-dimensional indexing scheme realizes better retrieval performance. [5] Proposed a fuzzy similarity-based self-constructing algorithm for feature clustering of data. The words in the feature vector of a document set are grouped into clusters, based on similarity test. Words that are alike to each other are cluster into the same cluster. Each cluster is distinguished by a membership function with statistical mean and deviation.

A new cluster-adaptive distance bound based on separating hyperplane limitations of Voronoi clusters to complement our cluster based index [3]. This bound enables efficient spatial filtering, with a comparatively small preprocessing storage overhead and is appropriate to euclidean similarity measures. Pruning system that is based on Voronoi diagrams to reduce the number of expected distance calculations [4]. These techniques are rationally recognized to be more effectual than the basic bounding-box-based technique. We then introduce an R-tree index to systematize the uncertain objects so as to reduce pruning overheads.

Subspace clustering is an emergent task [12] which aims at distinguish clusters entrenched in subspaces. Novel subspace clustering model to discover [11] the clusters based on the relative region compactness in the subspaces, where the clusters are observe as regions whose densities are comparatively high as compared to the region densities in a subspace. Based on this idea, dissimilar density thresholds are adaptively determined to find out the clusters in dissimilar subspace cardinalities.

A novel subspace clustering model to discover the clusters based on the relative region densities in the subspaces, where the clusters are stare as regions whose densities are relatively high as compared to the region densities in a subspace. Devised an innovative algorithm, referred to as DENCOS (DENsity Conscious Subspace clustering) [7], to accept a divide-and-conquer scheme to professionally find out clusters satisfying dissimilar density thresholds in diverse subspace cardinalities.

Proposed a new method for achieving k-anonymity [8] named K-anonymity of Classification Trees Using Suppression (kACTUS). In kACTUS, competent multidimensional suppression is performed. Thus, in kACTUS, recognize the attributes that have less power on the classification of the data records and repress them if needed in order to comply with k-anonymity. The kACTUS method was evaluated on 10 separate data sets to assess its accuracy as compared to other k-anonymity generalization- and suppression-based methods.

Temporal Data clustering proposes a novel weighted accord function guided by clustering validation criteria to reunite initial partitions to candidate consensus partitions from different perspectives, and introduce an agreement function to further resolve those candidate agreement partitions to a final partition [6, 10]. As a result, the proposed weighted clustering ensemble algorithm supply an effective enabling technique for the joint use of different representations, which cuts the information loss in a single representation and exploits various information sources fundamental temporal data.

The weighted clustering ensemble algorithm provides a helpful facilitate technique for the joint use of multiple representations [9], which cuts the information loss in a single representation and exploits various information sources causal temporal data. In addition, our approach tends to imprison the intrinsic structure of a data set, e.g., the number of clusters.

3 METHODOLOGIES

The different work involved in "Performance analysis of Adaptive cluster distance bounding" is:

3.1 High Dimensional Indexing of Clustering Framework

A similarity search in connected with high-dimensional data sets, are derivative within a clustering framework. Indexing by "vector approximation" (VA-File), are proposed as a technique to scrap the "Curse of Dimensionality," utilize scalar quantization, and unavoidably ignores dependencies across dimensions, which represent a source of sub optimality. Clustering, on the other hand, develop interdimensional correlations and is more compact representation of the data set.

However, existing methods to prune unrelated clusters are based on bounce hyper spheres and bounding rectangles, whose lack of stiffness compromises their effectiveness in correct nearest neighbor search. Proposed a new Cluster-Adaptive Distance Bound (ACDB) based on unscrambling hyper plane boundaries of Voronoi clusters to harmonize our cluster based index. This bound enables efficient spatial filtering, with a comparatively small preprocessing storage overhead and is applicable to Euclidean and Mahalanobis similarity measures. Experiments in exact nearest-neighbor set retrieval, conducted on real data sets, show that our indexing method is scalable with data set size and data dimensionality and outperforms several recently proposed indexes. Relative to the VA-File, over an widespread range of quantization resolutions, it is able to decrease random IO accesses, given the same amount of sequential IO operations.

3.2 Clustering Uncertain Data using Voronoi Diagrams and R-Tree Index

The problem of bunch uncertain objects whose locations are described by Probability Density Functions (PDF) are describes here. The UK-means algorithm, which oversimplifies the k-means algorithm to switch uncertain objects, is very incompetent. The inefficiency comes from the fact that UK-means work out expected distances (ED) between objects and cluster representatives. For arbitrary PDF, predictable distances are computed by numerical integrations, which are expensive operations. Proposed a pruning technique those are based on Voronoi diagrams to decrease the number of expected distance calculations.

These techniques are rationally proven to be more effectual than the basic bounding-box-based technique previously known in the literature. It shows that our techniques are additive and considerably outperform previously known methods.

3.3 Density Conscious Subspace Clustering for High-Dimensional Data

Instead of finding clusters in the full feature space, subspace clustering is a developing task which aims at perceives clusters embedded in subspaces. Most of previous works in the literature are density-based approaches; where a cluster is observe as a high-density region in a subspace. However, the recognition of dense regions in previous works lacks of considering a critical problem, called "the density divergence problem. Without considering this problem, previous works make use of a density threshold to find out the dense regions in all subspaces, which acquire the serious loss of clustering accuracy in different subspace cardinalities.

To undertake the density divergence problem, devised a novel subspace clustering model to discover the clusters based on the qualified region densities in the subspaces, where the clusters are observe as regions whose densities are comparatively high as compared to the region densities in a subspace. Based on this idea, different thickness thresholds are adaptively resolute to discover the cluster in different subspace cardinalities. Due to the infeasibility of applying previous techniques in this novel clustering model, DENCOS (DENsity Conscious Subspace clustering), invent an innovative algorithm to adopt a divide-and-conquer scheme to efficiently discover clusters satisfying different density thresholds in different subspace cardinalities. As validated by our widespread experiments on various data sets, DENCOS can find out the clusters in all subspaces with high quality, and the efficiency of DENCOS outperforms well.

3.4 Temporal Data Clustering with Different Representations

Temporal data clustering offer underpinning techniques for discovering the intrinsic structure and condensing information over temporal data. Proposed a novel weighted consensus function show by clustering validation criteria to reconcile initial partitions to candidate consensus partitions from different perspectives, and introduced an agreement function to further bring together those candidate consensus partitions to a final partition.

As a result, the proposed weighted clustering company algorithm supply an effective enabling technique for the joint use of dissimilar representations, which cuts the information loss in a single representation and exploits various information sources underlying temporal data. Simulation results reveal that our approach yields favorite results for a diversity of temporal data clustering tasks. As our weighted cluster collection algorithm can combine any input partitions to generate a clustering ensemble.

3.5 Clustering Algorithm for Text Classification

Feature clustering is an influential method to decrease the dimensionality of characteristic vectors for text classification. In this paper, a fuzzy similarity-based self-constructing algorithm for feature clustering is proposed. The words in the feature vector of a document set are grouped into clusters, based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. The extracted feature, matching to a cluster, is a weighted combination of the words restricted in the cluster.

By this algorithm, the resulting membership functions match closely with and clarify properly the real distribution of the training data. Besides, the user need not distinguish the number of extracted features in move ahead, and trial-and-error for powerful the appropriate number of extracted features then is avoided. Experimental results show that our method can run earlier and obtain better extracted features.

4 PERFORMANCE RESULT

In this section, it demonstrates the performance analysis of various multimedia databases through experiments by examining the high-dimensional indexing scheme and clustering distance based metrics. It is measured in terms of

Accuracy

Execution time

Performance of Indexes

Scalability

4.1 Accuracy

No. of Extracted Features

Accuracy (%)

ACDB method for Indexing

FSF Clustering Algorithm

Pruning technique

DENCOS Method

50

0.83

0.70

0.51

0.62

100

0.88

0.72

0.49

0.59

150

0.92

0.69

0.46

0.61

200

0.96

0.75

0.53

0.65

250

0.99

0.78

0.59

0.67

Table 4.1 No. of Extracted Features Vs Accuracy

Fig 4.1 No. of Extracted Features Vs Accuracy

Fig. 4.1 plots the Extracted features with Accuracy using different indexing schemes. This result shows that as the extracted features increases and the accuracy also increases dramatically. Accuracy is measured in percentage (%).Accuracy is in higher ratio in ACDB method for Indexing scheme compared with FSF Clustering Algorithm, pruning technique and DENCOS method in multimedia database. In this experiment, ACDB method for indexing produces better result than other schemes.

4.2 Execution Time

Size of Database

Execution Time (sec)

ACDB method for Indexing

Pruning technique

Weighted Clustering Algorithm

DENCOS Method

50 K

2

8

12

11

75 K

4

14

26

27

100 K

8

20

42

51

125 K

13

32

55

64

150 K

17

45

67

79

Table 4.2 Size of Database Vs Execution Time

Fig 4.2 Size of Database Vs Execution Time

Fig 4.2 shows Execution time of various Indexing schemes. Particularly, our analysis relies greatly on ACDB Indexing process. To check the performance of information retrieval from the database, setup a test which considers the execution time of the Adaptive Cluster Distance Bound (ACDB) scheme on database with payloads ranging in size from 50 Kilo bytes to 150 Kilo bytes. From the Figure 4.2 it can be seen that the ACDB Indexing Scheme is faster in execution when compared to the other existing system.

4.3 Performance of Indexes

No. of Random IO (Log scale)

Number of Sequential Pages

ACDB method for Indexing

DENCOS Method

FSF Clustering Algorithm

Pruning technique

10

2300

2060

1840

1140

100

2160

1830

1390

980

1000

2091

1575

1250

735

10000

2255

1684

950

850

100000

2435

1350

900

544

Table 4.3 No. of Random IO Vs Sequential Pages

The above table (Table 4.3) described the performance of Indexing Scheme with the various existing system. Random IO with Sequential Page count produces the performance of Indexing.

Fig 4.3 No. of Random IO Vs Sequential Pages

Fig 4.3, described the Indexing while requesting the queries. In the ACDB Scheme the variance in the Indexing would be 20-25% high when compared to other Indexing pattern.

4.4 Distance Ratio

Cluster ID

Query-Cluster distance (%)

ACDB method for Indexing

FSF Clustering Algorithm

Pruning technique

Weighted Clustering Algorithm

5

85

70

58

80

10

87

72

63

91

15

81

68

71

95

20

95

82

75

87

25

97

79

72

97

Table 4.4 Cluster ID Vs Query-Cluster Distance

Fig 4.4 Cluster ID Vs Query-Cluster Distance

Distance Ratio (PDR) is ratio of the cluster ID with the Query-Cluster distance. The Fig 4.4 shows that the Weighted Clustering Algorithm achieves high values for the Query Cluster Distance (92%) and remains significantly more efficient than the other four Indexing schemes. This comparison shows Weighted Cluster gives better result compared with ACDB method for Indexing (89%), FSF clustering Algorithm (75%), and Pruning Technique (68%).

CONCLUSION

This paper discussed the various methods of high-dimensional indexing scheme and cluster distance bounding in the multimedia databases. Comparisons are made to explain the advantages and limitations of different indexing schemes. Performance analyses of these schemes are evaluated through the experiments. Experimental results demonstrate that some of the schemes support high-dimensional indexing and some of the schemes support cluster distance bounding. Various schemes are examined and their performance is evaluated on four criteria: Accuracy, Execution Time, Performance of Indexes and Distance ratio. From the experimental results Adaptive Cluster Distance Bounding (ACDB) for high dimensional Indexing in multimedia database performs well in three criteria Accuracy, Execution time and Indexing performance compared with Pruning technique, Weighted Clustering Algorithm, DENCOS Method and FSF Clustering Algorithm.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.