# Enhanced Vat For Cluster Quality Assessment English Language Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The increased demand for clustering objects of unlabeled data into similarity group lies in determining number of clusters. In addition, the performance of cluster should be analyzed to provide the precise clustering of objects. Available clustering algorithm depends on the number of clusters C to search with threshold. The proposed method in this paper is Enhanced VAT, which robotically identifies the number of object groups or clusters in unclassified datasets. The proposed algorithm relies on Visual Assessment of cluster Tendency (VAT) with intermingles of Euclidean; Mahalanobis distance measures and common image processing techniques. Enhanced VAT produces a binary image, which can be visually assessed for the cluster tendency. However, VAT become disrupting for huge datasets. Enhanced VAT reduces the amount of computation and performs the dissimilarity with a different measure of metrics that are used for an effective visual evaluation process. Validation of our algorithm is performed on several UCI datasets and HIV real world datasets.

## Key words

Clustering, Enhanced VAT, Visual assessment, cluster tendency, automatic clustering, and UCI data sets.

## Introduction

The major issue in data mining is to pattern the observed data into knowledge structures. Clustering aims at classifying objects of a similar kind into their respective categories. Partitioning the set of objects O= {o1+o2+â€¦â€¦on} into C self-related objects is the major process of cluster analysis. All clustering algorithms will analysis and subjective (1 â‰¤ C â‰¤ n) clusters numbers, even if no "definite" clusters exist. Therefore, an essentially important problem to ask before applying any particular clustering algorithm is: whether clusters are present and how to review the cluster tendency value. Hathway et. al.[1] proposed a concrete solution for the problem of determining whether clusters are present by assessing of clustering tendency as a step earlier.

Various statistically based and informal techniques for tendency assessment are discussed in Jain and Dubes [2] and Everitt [3]. None of the existing approaches is completely satisfactory (nor will they ever be). The reason of this note is to add a simple and insightful visual approach to the existing repertoire of tendency assessment tools.

Various data analysis problems are based on visual approaches which have been widely studied in the last 25 years; Tukey [4] and Cleveland [5] are standard sources for many visual techniques. The visual approach for assessing cluster tendency introduced here can be used in all cases involving numerical data. It is both convenient and expected that new methods in clustering have a catchy acronym. Consequently, we call this new tool as VAT (visual assessment of tendency). In VAT the pairwise dissimilarity information about the set of objects O= {o1+o2+â€¦â€¦on} are represented as a square digital image with n2 pixels, then the objects are properly reordered to present the image which is able to emphasize the cluster potential. Where each object in O is represented as cloumn vector x in Rs, the set X ={x1,â€¦,xn} Rs is called an object data depiction of O. The mth component of the ith feature vector () is the value of the mth feature of the ith object. The objects O are represented as relational table. The notation R = [Rij], where Rij is the pair wise dissimilarity (usually a distance) between objects oi and oj, for 1 â‰¤ i, j â‰¤ n. More generally, R can be a matrix of similarities based on a variety of measures [6,7]. If the original data consists of a matrix of pair wise similarities S = [Sij], then dissimilarities can be obtained through several simple transformations. For example, we can take

Rij = Smax - Sij, (1a)

where Smax denotes the largest similarity value. If the original data set consists of object data X = {x1,â€¦,xn} Rs , then Rij can be computed as Rij = xi âˆ’ xj , using any convenient norm on Rs .If the original data has missing components, proper preprocessing techniques are applied prior to processing. The ultimate purpose of imputing data here is simply to get a very rough picture of the cluster tendency in O. Consequently, sophisticated imputation schemes, such as those based on the expectation-maximization (EM) algorithm in Dempster, Laird and Rubin [8], are unnecessarily expensive in both complexity and computation time. For incomplete object data, we would suggest the Dixon [9] scheme, which generates a pair wise Euclidean (or other norm) dissimilarity Rij from incomplete xi and xj simply by using all features common to both object data, and then properly scaling the result, based on how many of the S possible features are actually used. For missing dissimilarity values (Rij), one of the triangle inequality schemes in Hathaway and Bezdek [10] should be sufficiently accurate. We refer the reader interested in learning more about missing data and imputation to Little and Rubin [11] and Schafer [12]. So, we can assume without loss that dissimilarity data of the type needed for a VAT display can be easily obtained, whether the original data description of O is object or relational, and whether the data are complete or incomplete.

Key papers in the visual representation of data dissimilarity include the works of Sneath [13], Floodgate and Hayes [14], Ling [15], and Bezdek and Hathaway [16]. The common denominator in all of these methods is the "reordered dissimilarity image" (RDI). The intensity of each pixel in the RDI corresponds to the dissimilarity between the pair of objects addressed by the row and column of the pixel. A "useful" RDI highlights potential clusters as a set of "dark blocks" along the diagonal of the image, corresponding to sets of objects with low dissimilarity. An observer can merely estimate the figure of clusters C i.e., count the number of dark blocks along the diagonal of a RDI if these dark blocks acquire visual precision (see Fig. 1.1c).

## Fig.1.1 An example of VAT image. (a) Scatter plot of a 3,000 - point's data set with five clusters. (b) Unordered image (c) Reordered VAT image I().

Therefore, the VAT approach is applicable to virtually all-numerical data sets. In the next section we define the main idea of the VAT approach. Section 3 discusses relatives of VAT and similarity measures. Section 4 explains the Enhanced VAT algorithm and its implementation. Section 5 gives a series of examples using various real and synthetic data sets that illustrate various facets of the Enhanced VAT tool. The final section contains some concluding remarks and topics for further research.

## 2. Ordered Dissimilarity Images

Let R be an n x n dissimiarity matrix correspnding to the set O= {o1+o2+â€¦â€¦on}. We assume that R satisfies the following (metric condition for all 1â‰¤ i, j â‰¤ n:

Rij â‰¥ 0 (2a)

Rij = Rji (2b)

Rii = 0 (2c)

We display R as an intensity image I, which we call a dissimilarity image. The intensity or gray level gij of pixel (i,j) depends on the value of Rij. The value Rij = 0 corresponds to gij = 0 (pure black); the value Rij = Rmax, where Rmax denotes the largest dissimilarity value in R, gives gij = Rmax (pure white). Intermediate values of Rij produce pixels with intermediate levels of gray in a set of gray levels G = {G1,â€¦,Gm}. The images shown below use 256 equally spaced gray levels, with G1 = 0 (black) and Gm = Rmax (white). The displayed gray level of pixel (i,j) is the level gij G that is closest to Rij. As an example, Fig. 2.1 lists a small dissimilarity matrix and its corresponding image. The 0 values on the main diagonal of R generate main diagonal pixels that are black. Notice also that the largest dissimilarity value (0.78) gives two white pixels in the dissimilarity image.

## Fig.2.1. A dissimilarity matrix and its corresponding dissimilarity image.

Does the image in Fig. 2.1 indicate that clusters are likely for the five objects underlying the relational data shown there? More generally, can a dissimilarity image show the occurrence of clusters? We surmise that the usefulness of a dissimilarity image for visually assessing cluster tendency depends crucially on the ordering of the rows and columns of R. We will attempt to reorder the objects {o1,o2,â€¦,on} as { â€¦ , }so that, to whatever degree possible, if ki is near kj, then is similar to . In this case, the corresponding ordered dissimilarity image (ODI) will frequently specify cluster tendency in the data by dark blocks of pixels along the main diagonal. A procedure for ordering is given immediately after our second example.

## Fig.2.2 Scatterplot of Data Set A.

Fig. 2.2 is a scatterplot of n = 20 points in a 2-dimensional data set, called Data Set A, that we use to illustrate the importance of properly ordering the rows and columns of the dissimilarity matrix for visual assessment of tendency. This data set has (either) three visually apparent clusters and one outlying point, or 4 clusters, if singleton clusters are allowed. Fig. 2.3 shows the 19 sequential distances {d12, d23, â€¦, d19,20} between points {x1, x2, â€¦, x20}, where indices {1,â€¦,20} correspond to a random initial ordering of the points. The path of line segments indicates the ordering of the data in the scatterplot, with the firstly ordered point represented by the heavy square.

## Fig.2.3 . Scatterplot and dissimilarity image for Data Set A (original random ordering).

The corresponding dissimilarity image in the right view of Fig. 2.3 contains no useful (visual) information about (apparent) structure in Data Set A. Now, we reorder the points so that nearby points are generally) indexed similarly.

## Fig. 2.4. Scatterplot and ODI for Data Set A (reordered)

Fig. 2.4 gives the scatterplot for the reordered points along with the corresponding ordered dissimilarity image. The ODI indicates the likelihood of clusters, as seen by the one large and 2 smaller blocks of dark pixels along the main diagonal of the ODI. The isolated outlier is seen as the single black diagonal pixel in the last row and column. The line segments shown in the left view of Fig. 2.4 indicate the sequence of (reordered) indices {k1, â€¦, k20}, with again indicated by the heavy dark square.

The dark diagonal blocks in the ODI of right view Fig 2.4 clearly indicate the presence of 1 large and 2 smaller clusters, as well as the isolated singleton in data set A. The mechanism underlying the emergence of visually clear blocks on the diagonal of the display is simple. If the algorithm orders the points so that nearby points (generally) have similar index values, then rows of R with similar index values will be similar. This will give repetition in the pixel patterns of nearby groups of rows of R, which will in turn give rise to a visible block structure in the ODI. A black block corresponds to a set of nearby points, consecutively ordered. Without the proper ordering, it is essentially impossible to visually assess clustering tendency using a dissimilarity image. VAT ordering algorithm is similar to Prim's algorithm which is performed for verdict a minimal spanning tree (MST) of a weighted graph.

## 3. Relatives of VAT and similarity measures

## 3.1 Literature of VAT

We can approximately group visual exhibit methods into three categories: visual displays of clusters, visual displays to find clusters, and visual displays to assess tendency which was proposed by Huband et.al., [17]. The original published reference we can find that discusses visual displays of clusters is the SHADE approach of Ling [15]. SHADE approximates what is now regarded as a nice digital image representation of clusters using a crude 15 level halftone scheme created by over striking standard printed characters. Johnson and Wichern [18] presented the "graphical method of shading" which was related to SHADE. Bezdek and Hathaway [16] had made the displays using distance data.

## 3.2 Similarity measures

Shoji Hirano et. al., [19] performs the distance metrics measures are let U = {x1,x2,â€¦â€¦xN} be the set of object where N denotes total number of objects. Also let us assume that each object has p = pc + pd attributes where pc is the number of numerical attributes and pd is that of nominal attributes. Then we denote an object xi={, , â€¦â€¦, where denotes the jth attribute value of object xi.

3.2.1 Similarity for categorical attributes

In order to measure similarity for categorical attributes, we adopt the Hamming distance that counts the number of attributes for which two objects have different attribute values.

dH(xi,xj) = (3c)

= {1 if = , 0 otherwise

3.2.2 Similarity for numerical attributes

In order to measure similarity for numerical attributes, we adopt the Mahalanobis distance:

dm=(xi,xj)={(xi-xj)Tâˆ‘-1(xi-xj)}1/2 (3a)

where âˆ‘ denotes the variance-covariance matrix. If all of the attributes are independent, and if all of the attribute values are standardized, the Mahalanobis distance of objects exactly matches the Euclidean distance given below.

dE=(xi,xj)={( -)2+()2+â€¦â€¦â€¦â€¦..+()}1/2 (3b)

3.2.3 Similarity for mixture attributes

If objects have together numerical and categorical attributes, their similarity is calculated as a weighted sum of the Mahalanobis distance dM(xi,xj) of numerical attributes and the Hamming distance dH(xi,xj) of nominal attributes as follows:

d(xi,xj)= dM(xi,xj)+ dH(xi,xj) (3d)

## 4. Enhanced VAT algorithm

## Enhanced VAT and Display Algorithm

## Input

Load the multi dimensional dataset and convert it into dissimilarity matrix using Euclidean, Hamming and Mahalanobis distance for numerical, categorical and mixed attributes respectively. Consider the dataset as n x n dissimilarity matrix.

D = [dij] Where 1 >= dij >=0;dij = dji ;dii = 0,for 1 <=i,j<=n

## Process

Step (1): Transform D to a new dissimilarity matrix R with 1-exp(-dij/), where is a scale parameter determined from D using the algorithm of Otsu automatically.

Otsu's algorithm [20], which maximizes the between-class variance, has been widely used in image processing for automatically choosing a global threshold. The method relies upon the assumption that all image pixels belong to one of two classes, i.e., background or foreground.

Step (2): Form a RDI image I(1) corresponding to R using the VAT algorithm.

Let I and J be subsets of K = {1, â€¦, n}. We let argmin {Rpq } pI,qJ denote the set of all ordered index pairs (i,j) in I Ã- J such that Rij = argmin {Rpq } pI,qJ . This differs from the usual meaning of "arg min" only in that a call to arg min (f(*)) ordinarily returns only one value of (*) that minimizes f, whereas here we collect all values of (*) that yield the (same) minimizing value. The "arg max" notation is defined similarly. The algorithm for producing an ordered dissimilarity matrix = [] from the original dissimilarity matrix R is now given. The permuted index of the n objects are stored in an array P [ ], with P (i) = ki, i = 1, â€¦, n.

Step (2.1): Set I = Ð¤ , J = { 1,2,â€¦.n } and P = (0,0,0â€¦â€¦..0).

Select (i,j) â‚¬ arg p â‚¬ j and q â‚¬ j max { d pq }

Set P (1) = i, Iâ† {i} and J â† J - { i }

Step (2.2): Repeat for t = 2,â€¦..n

Select ( i,j ) â‚¬ arg p â‚¬ i and q â‚¬ j min { d pq }

Set P (t) = j , Update I â† I { j } and J â† J - { j }

Step (2.3): Form the dissimilarity matrix R = [ d ij ] = [ d P(i)P(j) ]

Where 1i,j n

Step (3): Display the reordered matrix as the ODI using the conventions given above.

## Output

Scaled gray scales image I (D) so that max (dij) corresponds to White and min (dij) corresponds to black.

## 5. Experimental Results

The results of several UCI datasets and real-world examples were evaluated for the performance of this method. The first five datasets were dermatology, heart, hepatisis, iris and wine of UCI Machine Learning Repository. Another real world dataset was HIV with drug regimen as the class attribute. For each example, the VAT image and the enhanced VAT with class attribute was performed. The data sets' characteristics and the results of VAT and enhanced VAT are summarized in Table 5.1.

## Table. 5.1 Summary of UCI and Real datasets' characteristics and the results using VAT and Enhanced VAT

Data set

Dermatology

Heart

Hepatisis

Iris

Wine

HIV

# instances

357

270

72

150

178

400

# clusters

6

2

2

3

3

6

# each cluster

[110,59,70,48,51,19]

[150,120]

[12,60]

[50,50,50]

[59,71,48]

[221,144,11,17,5,2]

Attribute

Integer

Integer/Real

Integer

Integer/Real

Integer/Real

Integer/Real

# attributes

34

13

20

5

13

19

VAT

5

2

2

2

3

4

Enhanced VAT

6

2

2

3

3

5

## 5.1 Results for UCI dataset of Dermatology

The aim of this database is to determine the type of Eryhemato-Squamous Disease. They all share the clinical features of erythema and scaling, with very little differences. The diseases in this group are psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. The dataset contains 357 instances with 34 attributes including class attribute. i.e., 110 for class 1, 59 for class 2, 70 for class 3, 48 for class 4, 51 for class 5 and 19 for class 6. Starting with 34-dimensional feature vectors, we computed pairwise dissimilarities using the euclidean, Hamming, Mahalanobis distance to get relational data for VAT and Enhanced VAT.

## Fig.5.1. (a) Histogram of the dissimilarity matrix (b) Unordered image of dermatology dataset (c) VAT image I() of R. (d) Enhanced VAT image I() of R.

Fig.5.1 (a) shows the histogram of the dissimilarity matrix, which exhibits the data values of the dataset in a graphical representation. The results of VAT and Enhanced VAT are compared. Fig.5.1(c) depicts the VAT image for all attributes and shows that the clusters were overlapped so the numbers of dark blocks in the image were not clear. But in the Fig.5.1 (d) the enhanced VAT image for class attribute shows the number of dark blocks clearly on the diagonal. The image depicts 6 dark blocks, which matches the number of classes in the dataset.

## 5.2 Results for UCI dataset of Heart

This dataset contains the results of the prediction of heart attack. The dataset contains 72 instances and 13 attributes they are age, sex, chest pain type (4 values), resting blood pressure, serum cholesterol, fasting blood sugar, etc. The total number of instances in this data set is n=270, i.e., 150 denotes absence and 120, the presence of heart attack. Starting with 13-dimensional feature vectors, we computed pair wise similarity measures to get relational data for VAT and Enhanced VAT.

## Fig.5.2. (a) Histogram of the dissimilarity matrix (b) Unordered image of Heart dataset (c) VAT image I() of R. (d) Enhanced VAT image I() of R.

The results of VAT and Enhanced VAT were presented in the fig.5.2. Histogram of the dissimilarity matrix was depicted in the fig 5.2(a). VAT image was represented in fig 5.5(c), which can be compared and analyzed with the enhanced VAT image fig 5.2(d). The image shows 2 dark blocks, which matches the number of classes in the dataset.

## 5.3 Results for UCI dataset of Hepatisis

Hepatisis is an inflammation of the liver characterized by the presence of inflammatory cells in the tissue of the organs. This dataset contains the details of the patient from which we analyze whether they are alive or not. The data set contains 72 instances with 20 attributes (including class) they are age, sex, steroid, antiviral, fatigue, malaise, etc. The total number of instances in this data set is n=72, i.e., 12 were analyzed as dead, and 60 are alive. Starting with 20-dimensional feature vectors, we performed distance metrics to get relational data for VAT and Enhanced VAT.

## Fig.5.3 (a) Histogram of the dissimilarity matrix (b) Unordered image of Hepatisis dataset (c) VAT image I() of R. (d) Enhanced VAT image I() of R.

Fig.5.3 (a) shows the histogram of the dissimilarity matrix, which presents the data values of the dataset in a GUI representation. The results of VAT and Enhanced VAT were analyzed. Fig.5.3 (d) shows the enhanced VAT image for class attribute shows the number of dark blocks clearly on the diagonal. The image depicts 2 dark blocks, one is larger and another is smaller matching the number of classes in the dataset.

## 5.4 Results for UCI dataset of Iris

This is perhaps one of the best-known databases to be found in the pattern recognition literature. The data set contains three physical classes, 50 instances each (n=150), where each class refers to a type of iris plant. The attributes of each instance include four numeric values, corresponding to sepal length, sepal width, petal length and petal width, respectively. It is generally accepted [21] that in this data set, one class is linearly separable from the other two classes while the latter two are not linearly separable from each other.

## Fig.5.4 (a) Histogram of the dissimilarity matrix (b) Unordered image of Iris dataset (c) VAT image I() of R. (d) Enhanced VAT image I() of R.

Fig.5.4 (a) shows the histogram of the dissimilarity matrix, which exhibits the data values of the dataset in a graphical representation. The results of VAT and Enhanced VAT are compared. Fig.5.4(c) depicts the VAT image for all attributes and shows that the clusters were overlapped. Fig.5.4 (d) the enhanced VAT image for class attribute shows the number of dark blocks clearly on the diagonal. The image depicts 3 dark blocks one is larger and another is smaller which matches the number of classes in the dataset.

## 5.5 Results for UCI dataset of wine

This data set contains the results of chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determines the quantities of 13 constituents found in each of the three types of wines. The attributes are respectively alcohol, malic acid, ash, magnesium, etc. The total number of instances in this data set is n =178, i.e., 59 for class 1, 71 for class 2 and 48 for class 3. Starting with these 13-dimensional feature vectors, we computed pair wise dissimilarities using the euclidean distance to get relational data for VAT and enhanced VAT.

## Fig.5.5 (a) Histogram of the dissimilarity matrix (b) Unordered image of wine dataset (c) VAT image I() of R. (d) Enhanced VAT image I() of R.

Fig.5.5 (a) shows the histogram of the dissimilarity matrix, which exhibits the data values of the dataset in a graphical representation. Fig.5.5 (d) the enhanced VAT image for class attribute shows the number of dark blocks clearly on the diagonal. The image depicts 3 dark blocks one is larger and other 2 are smaller.

## 5.6 Results for HIV real world data set

The proposed method is tested on a set of samples collected from various Integrated counseling and Testing center (ICTC) and Antiretroviral (ART) centers of Tamilnadu and pondicherry. The preprocessing techniques were performed and then Enhanced VAT algorithm was applied to the HIV/AIDS diagnosis dataset containing 400 objects.

## Table. 5.2 Structure of the HIV (preprocessed) dataset

Obj #

CA

Age

Sex

WT

HB

Treat-Drug (regimen)

## â€¦

WBC

CD4 Count

TLC

SGPT

1

1

25

1

60

14

1

## :

4600

500

4.0

46.0

2

2

35

1

48

11

2

## :

6400

100

5.0

47.0

## :

1

## :

## :

## :

## :

## :

## :

## :

## :

## :

## :

## :

1

## :

## :

## :

## :

## :

## :

## :

## :

## :

## :

400

2

45

0

58

13.5

1

## â€¦

3500

150

3.0

40.0

Table 5.2 shows the structure of the dataset with preprocessing depends upon the attribute nature. The attributes are respectively Age, Sex, WT, HB, Treat Drug, Pill count, Initial drug, Occupation, Marital status, CD4, CD8, Ratio, WBC, RBC, PCV, platelet, TLC, SGPT, SGOP and Drug regimen- Class Attribute (CA). The total number of instances in this data set is n=400, i.e., 221 for class 1, 144 for class 2, 11 for class 3, 17 for class 4, 5 for class 5 and 1 for class 6. Starting with 19-dimensional feature vectors, we computed pair wise dissimilarities using the Euclidean, Hamming, Mahalanobis distance to get relational data for Enhanced VAT.

## Fig.5.6. (a) Histogram of the dissimilarity matrix (b) Unordered image of HIV dataset (c) VAT image I() of R. (d) Enhanced VAT image I() of R.

Fig.5.6 (a) shows the histogram of the dissimilarity matrix, which exhibits the data values of the dataset in a graphical representation. The results of VAT and Enhanced VAT are compared. Fig.5.6 (d) the enhanced VAT image for class attribute shows the number of dark blocks clearly on the diagonal. The image depicts 5 dark blocks, which matches the number of drug regimens in the dataset.

From the present study, the qualities of clusters are verified with the dark blocks on the diagonal obtained on the enhanced VAT formation. It ensures the impact of objects associated to the clusters in the reversed format.

## 6. Conclusions

This paper investigates a nearly parameter-free method for automatically estimating the number of clusters in unlabeled data sets. The improved version of VAT algorithm works for unspecified data objects of n x n dissimilarity matrix and to evaluate the quality of cluster being determined. The Enhanced VAT, in addition to the fact of automatically determining the number of clusters from a set of rough data set, by reverse ordering the matrix with diagonal axis on the block objects, can validate the cluster object relationship. With the verified correlation of the objects associated to the suitable cluster, the unlabeled data cluster creation is more specific. Enhanced VAT helps in identifying the dissimilarity object's property that relates to the cluster. The intensity of unlabelled data in each and every cluster formed can be verified using the dark blocks in the diagonal axis by means of evaluating the number of clusters. The improved VAT integrates the information in a set of cluster profile graphs when viewed sequentially.