# Cluster analysis

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

### CHAPTER TWO: LITERATURE REVIEW

We employ cluster analysis to address our research question as it is a technique that allows users to identify natural underlying group structure of complex datasets. In doing so, we can identify the type of firms and the level of firm quality that a particular group of directors is associated with. Specifically, we expect that directors sitting on high and low quality firms will be grouped together respectively. Figure 4.1 shows a flowchart illustrating the methodology employed in this study. After cluster analysis, we perform regression analysis. Using the relation found in regression analysis, we compute the predicted number of directorships for all directors included in our analysis. Finally, we do a comparison of the actual and predicted number of directorships for directors in each group obtained from cluster analysis earlier. Such a comparison allows us to directly address our hypotheses developed in Chapter Three.

### 1.1.1. Cluster Analysis

Cluster Analysis is an unsupervised learning technique, which allows users to explore complex datasets, through the identification of natural group structures underlying the data (Everitt, 1993; Jain et al., 1999; Duda et al., 2001; Hastie et al., 2001). In essence, cluster analysis partitions objects into various groups for which similarity between objects in the same group, and dissimilarity between groups are maximized (Hand et al., 2001). These objects are represented by multi-dimensional variables, and the similarity or dissimilarity between two objects are measured as the distance between the multi-dimensional variables vectors that represent the objects.

There are many ways of performing cluster analysis, and it is highly improbable that two different algorithms chosen will lead to completely identical partitions of data. Further, each algorithm is associated with potential shortcomings; no one particular algorithm can be considered the best. Hence, rather than choosing a particular algorithm to perform our cluster analysis and following Shen et al. (2007), we perform cluster analysis by employing three conceptually different algorithms - Hierarchical Clustering, K Means clustering and Expected Maximization for Gaussian Mixture Model, and subsequently employ consensus clustering to obtain a single consensus solution. This methodology allows us to avoid the potential shortcomings of various clustering methods, improving the reliability of our clusters solutions. Figure 4.2 shows a flowchart illustrating the steps we take to perform cluster analysis.

There are two main methods of clustering directors in our study; either by individual director characteristics or by individual firm characteristics. Firm characteristics can differentiate between directors as they define the type of firm a director is associated with. For instance, two directors will appear to be different when one sits on a firm with large market capitalization while the other sits on a small market capitalization firm. In our study, we attempt to cluster directors by the latter method as it is reasonable to expect that individual director characteristics (e.g. previous experience) have been considered by companies themselves when they invite those directors to join their boards. In this way, our cluster analysis also takes in consideration individual director characteristics. In defining clustering variables, we follow a deductive approach in which the suitability of variables is theory-based (Ketchen et al., 1993). This is often considered a better approach to guide variable choices as irrelevant variables may affect the validity of a clusters solution (Punj and Stewart, 1983). We employ the seven firm characteristics previously discussed in Chapter Three as our clustering variables.

As cluster analysis groups objects by maximizing distance between groups along all clustering variables, variables with larger ranges tend to dominate (i.e. given more weight) the clusters solution in comparison to those variables with smaller ranges (Hair et al., 1992). Considering that the range of firm size (measured by market capitalization) is several times larger than the range of ownership structure variables, firm size variables are likely to dominate our clusters solution. Although some studies claim that standardization of variables to equally scaled data is not necessary (Dolnicar, 2002) and may distort results, we choose to standardize the variables by Z-scores to avoid clusters solutions that are dominated by market capitalization.

### 1.1.1.1. Cluster Analysis Algorithms, Cluster Validation and Consensus Clustering

In the following sections, we describe the three algorithms employed to perform cluster analysis. Subsequently, we discuss the cluster validations steps taken and illustrate the use of consensus clustering to find a single consensus solution. We perform the following steps on three sets of data; all directors, outside directors and independent directors as extant research have found that dissimilar behaviours between inside and outside directors can lead to contrasting results in studies (Mak et al., 2003 and Fich and Shivdasani, 2006).

### 1.1.1.1.1. Cluster Analysis Algorithms

Hierarchical clustering

Hierarchical clustering can be either agglomerative or divisive. Divisive methods begin with every object in one cluster and end when each object is in individual clusters. Divisive methods are not readily available and are thus rarely cited. In this thesis, we employ the more commonly used agglomerative method and discuss the basic process below.

The basic process of an agglomerative hierarchical clustering for N objects is (Johnson, 1967):

1. Assign each object to its own cluster, thus the method begins with N clusters.
2. The closest (most similar) pair of clusters are merged into a single cluster, thus there are now (N-1) clusters.
3. Recompute distances (similarities) between the new cluster and the old clusters individually.
4. Iterate through steps 2 and 3 until there is only a single cluster left.

In step 2, several methods to compute distances exist, and they are referred to as linkage function. For instance, single linkage clustering computes similarity of two clusters as the similarity of their most similar members. Complete linkage clustering measures the similarity of two clusters as the similarity of their most dissimilar members. In the analysis, we choose Ward's method (Ward, 1963), a method that is distinct from the aforementioned methods, as our linkage function. Ward's method (Ward, 1963) chooses each successive merging step through the criterion of minimizing the increase in the error sum of squares (ESS) at each step. The ESS of a set X of NX values is given by the functional relation,

where: |.| is the absolute value of a scalar value or the norm (the "length") of a vector.

The linkage function referring to the distance between clusters X and Y is given by,

where: XY is the combined cluster resulting from merged clusters X and Y;

ESS(.) is the error sum of squares described above.

In addition to a linkage function, a metric for measuring distance between two objects is required. In our study, the Squared Euclidean Distance (SED) is chosen as our metric for distance measure for both Hierarchical Clustering and K Means clustering. If two objects, x1 and x2 in the Euclidean n-space is given by x1 = (x1i, x1i, …, x1n) and x2 = (x2i, x2i, …, x2n), then the SED between these two objects is given by,

While Agglomerative Hierarchical Clustering (AHC) does not require user to specify the number of clusters, k, a priori, a drawback of AHC is that it neglects the phenomenon of input order instability. In steps 2 and 3 of an AHC, a problem arises when two pairs of clusters are both calculated to have the smallest distance value. “In such cases arbitrary [italics added] decisions must be made” (Sneath & Sokal, 1973) to choose the pair of clusters that will be merged. These arbitrary decisions extend to computer programs (Spaans & Van der Kloot, 2005) and as a result, different input orders of objects in the proximity matrix can result in significantly different clusters solutions (Van der Kloot et al., 2005). To avoid this pitfall, we employ PermuCLUSTER for SPSS (Van der Kloot et al., 2005). This program repeats AHC for a user specified number of times by permuting the rows and columns of the proximity matrix. Thereafter, it evaluates the quality of each AHC solution by using a goodness-of-fit measure (SSDIFN) given by,

where: dij is the distances of the objects in the original proximity matrix

cij is the distances of the objects in the AHC tree

In our analysis, we first employ PermuCLUSTER where the number of AHC repetitions is set at 500, and evaluate the resultant optimal solutions obtained. Thereafter, we validate the clusters solutions (k = 2 to 35) of the optimal solution.

K Means clustering

In K Means clustering, K refers to the number of clusters, though unknown a priori, has to be specified by the user. There is a centroid in each cluster, usually computed as the mean of the variable vectors in that cluster. Clustering is decided based on the association to the nearest centroid.

The basic process of the K Means clustering (MacQueen, 1967) is:

1. Determine initial centroids.
2. Find the closest centroid to each object and assign the object to the cluster associated with this centroid.
3. Recalculate the centroid for each of new clusters based on new cluster memberships.
4. Iterate through steps 2 and 3 until convergence.

The algorithm converges when cluster membership of data points remain unchanged. In this situation, other widely-used conditions such as centroid computed and sum of squared distances from data points to their centroids stay constant.

K Means clustering iteratively shifts objects to various clusters, seeking to minimize the sum of squared distances, denoted by J, between each object and its cluster centroid. The sum of squared distances, Ji, for the ith cluster, denoted as Ci is given by,

where: is the Squared Euclidean Distance from object x in Ci to its centroid yi

The sum of squared distances of all the k clusters is given by,

In step 1, different sets of initial centroids can ultimately result in different local minima of J. However, we would like to find the clusters solutions that can result in the global minimum. The best methodology involves utilizing all sets of initial centroids in the analysis, but it is expensive and thus not viable. As an alternative, we repeat K Means clustering (for k = 2 to 35) 500 times with 500 random sets of initial centroids, to find clusters solutions that are either global minima or at least local minima that is the closest to the global minima among the various local minima.

Expectation Maximization for Gaussian Mixture Model

In the Gaussian Mixture Model (GMM), Expectation maximization (EM) algorithm seeks to find the maximum likelihood estimates for mixture models when the model is dependent on unknown latent variables.

The main steps of the EM method are (Dempster et al., 1977):

1. Compute the parameters (mean and variance) for the k Gaussian distributions.
2. Using the probability density function of Gaussian distribution, calculate the probability density for each feature vector in each of the k clusters
3. With the probability densities calculated in step 2, re-compute the parameters for each of the k Gaussian distributions
4. Repeat Steps 2 and 3 until convergence.

We perform EM clustering using MIXMOD for Matlab (Biernacki et al., 2006) and the statistical documentation are as follows. Clustering using mixture models typically partitions x objects into K clusters denoted by labels , with and depending on whether xi is assigned to kth cluster or not. In a mixture model where n independent vectors of a dataset are represented by x = {x1,…,xn}, each xi arises from a probability distribution with density,

where: pk is the mixing proportions (0 < pk < 1 for all k = 1, …,K and p1 +…+pK = 1)

h(.|λk) is the d-dimensional distribution parameterized by λk.

As such, we can show how each xi arises from a probability distribution with density in a GMM by replacing λk with its associated d-dimensional Gaussian density with mean µk and variance matrix Σk ,

where:

ϴ = (p1…,pK, μ1,…,μK, ∑1,…, ∑K) is the vector of the mixture parameters

Clusters can be derived from the maximum likelihood estimates of the mixture parameters ϴ obtained by using the Expectation Maximization (EM) algorithm. The maximum likelihood estimation of the GMM is given by,

Each xiis assigned to the cluster that provides the largest conditional probability that xi arises from it using the Maximum A posteriori (MAP) principle.

The MAP principle is given by,

where:

EM algorithm seeks to find maximum likelihood estimates through iteration of the expectation and maximization steps.

The mth iteration of the Expectation step is given by,

and the maximum likelihood estimate of the mth iteration (denoted by ϴm) is updated using the conditional probabilities as conditional mixing weights. This leads to the Maximization step as given by,

where:

As in K Means clustering, different initial values of the parameters may lead to different local maxima of the maximum likelihood estimate function. As such, to ensure that we can get a maximum that is either the global maximum or a local maximum that is closest to the global maximum, for each k (where k = 2 to 35), we repeat the algorithm 500 times using 500 random sets of initial parameters. The optimal solutions are the clusters solutions which result in the highest maximum likelihood estimate.

### 1.1.1.1.2. Cluster Validation

As cluster analysis is an unsupervised technique, cluster validation is a necessary step to evaluate results of cluster analysis in an objective and quantitative manner.

The main validation objectives are:

1. Determination of clustering tendency
2. Determination of the number of clusters
3. Evaluate how well a clusters result represent the natural group structure underlying the data based on information intrinsic to the data alone (i.e. internal validation) (Handl, Knowles, & Kell, 2005);
4. Evaluate clusters results based on comparison with known class labels which correspond to the natural group structure underlying the data (i.e. external validation) (Handl, Knowles, & Kell, 2005)

As clustering techniques are known to find clusters even when there is no underlying cluster structure, objective 1 is fundamental for cluster analysis. Objective 2 is imperative because the number of clusters is an essential parameter in two clustering techniques that we employ. To the best of our knowledge, this is the first time a cluster analysis is performed on the market for directors. Hence, there are no established class labels that correspond to the natural cluster structure. Thus we will only carry out an evaluation based on internal validation measures.

### Assessment of Clustering Tendency

In comparison to other validation steps, assessing clustering tendency is a step prior to actual clustering of the data. In our study, we utilize self-organizing maps (SOMs) to assess the clustering tendency of our data. SOM Toolbox for Matlab (Vesanto et al., 1999) is employed to perform SOM training and visualization. An SOM consists of neurons as components that are organized on a regular low dimensional grid. Each neuron is represented by a weight vector of the same dimensions as the input vectors. Connections between adjacent neurons are by a neighbourhood relation, which dictates the topology, or structure, of the map. The SOM training algorithm moves the weight vectors around so that the map is organized in a way whereby neurons of similar weight vectors are grouped together.Visualization of SOM is performed through the U-matrix. By visualizing distances between neighbouring map units, U-matrix allows the creation of a clustering tendency map. In a clustering tendency map, high values (represented by dark-coloured hexagons) of the U-matrix indicate possible clusters borders while uniform areas of low values (represented by light-coloured hexagons) show possible clusters. Figure 4.3 illustrates a high clustering tendency map and a low clustering tendency map.

### Internal Validation

We mainly make use of internal validation indices to evaluate the fitness of a clusters solution. Fitness measures are associated with the geometrical properties of clusters (i.e. compactness, separation and connectedness). These properties are utilized as most clustering methods usually optimize these properties to discover underlying group structure in the data (Johnson, 1967; Dempster et al., 1977; Kaufman and Rousseeuw, 1990; Handl and Knowles, 2006). Utilization of internal validation indices also allows us to find the optimal number of clusters (k), indicated by the clusters solution with the highest quality. For Hierarchical Clustering and K Means clustering, employing the program CVAP (Wang et al., 2009), we validate our clusters solutions with two different indices - Average Silhouette Width and C-Index, to ensure that our clustering results are robust to different validation measures.

### Average Silhouette Width

Average Silhouette Width is a composite index which measures both compactness and separation of clusters (Kaufman and Rousseeuw, 1990). Silhouette width compares the similarity between an object and other objects in the same cluster with the similarity between the same object and other objects in a neighbour cluster. A neighbour cluster N(Xi) to object Xi in cluster C(Xi) is defined as the cluster whose objects have the shortest average distance to object Xi amid all the clusters beside cluster C. The neighbor cluster N(Xi) is given by,

where: Xiis the objects in the dataset

d(Xi,Xj) is the distance between two objects Xi and Xj

The silhouette width for Xi, as denoted by Si, is given by,

where: is the average distance between Xi and the objects in cluster C(Xi)

is the average distance between Xi and the objects in neighbour cluster N(Xi)

Silhouette width, Si, ranges from -1 to 1. When Si is close to 1, the clustering solution give good clusters and that Xi is likely to be assigned to the appropriate cluster. When Si is close to 0, Xican likely be assigned to another cluster and when Si is close to -1, Xi is likely to be assigned to a wrong cluster.

Average Silhouette Width (AS) is given as,

Thus, the best clusters solution associated with the optimal number of clusters (k) is given by the AS with the largest value.

C Index

C Index (Hubert and Levin, 1976) is a cluster similarity measure. The best clusters solution is identified as the solution that results in the lowest value. C Index (C) is given by,

where: S is the sum of pairwise dissimilarities between all pairs of objects in the same cluster

If the cluster has n such dissimilarities, then Smin is the sum of the n smallest pairwise dissimilarities

Similarly, Smaxis the sum of the n largest distance for all the pairs of pattern

In CVAP (Wang et al., 2009) however, the optimal k is given by the value which results in the steepest knee. Steepest knee refers the greatest jump of indices value between 2 k.

Bayesian Information Criterion

When using EM algorithm to maximize the likelihood of the parameters, a phenomenon known as overfitting (Hand et al., 2001) may occur. In this situation, additional parameters are chosen to increase the likelihood obtained thus resulting in an overly complex model which fits the data too closely.

As such, we validate our clusters solutions with Bayesian Information Criterion (BIC) as it is formulated to avoid overfitting.

Typically, one can choose a model that maximizes the integrated likelihood given by,

where: integrated likelihood is

likelihood is

ϴm,K is the parameter space of the model m with K clusters

is a non informative or a weakly informative prior distribution on ϴ for this model

Following that, the asymptotic approximation of the integrated likelihood valid under regularity conditions (Schwarz, 1978) is given as,

where: is the maximum likelihood estimate of ϴ

vm,K is the number of free parameters in model m with K clusters

Finally, this results in the minimization of BIC criterion given by,

where is the maximum log-likelihood for m and K

### 1.1.1.1.3. Consensus Clustering

Consensus clustering, as its name suggests, is employed to find a consensus solution that is in agreement with several clusters solutions obtained through multiple clustering algorithms. As there are associated shortcomings with individual clustering algorithms, for instance in K Means clustering, one has to specify number of clusters a priori, consensus clustering, by combining solutions, can help eliminate such shortcomings. Further, consensus clusters solutions are less sensitive to noise, outliers or sample variations (Nguyen and Caruana, 2008). Shen et al. (2007) also contend that combining clusters results from multiple methods is more likely to expose the underlying natural group structure and trends present in the dataset. As a result, consensus clstering helps to increase the quality and robustness of clusters results (Strehl and Ghosh, 2002).

Intuitively, a superior aggregate solution should share as much information as possible with individual clusters solutions, as clusters which remain relatively similar across multiple algorithms runs are likely to reflect the actual group structure underlying the dataset. Accordingly, in our analysis, we make use of Meta-Clustering Algorithm (MCLA) (Strehl and Ghosh, 2002) that seeks to optimize the average mutual information shared between different pairs of clusters. Before illustrating MCLA, we will first explain the transformation of a cluster solution into hypergraph representation. Figure 4.4 illustrates the transformation of a cluster solution into hypergraph representation. Each cluster solution is transformed into a binary membership indicator matrix Hq with a column for each cluster (denoted as hyperedge hi). For each hi, 1 indicates that the vertex which corresponds to the row belongs to that hyperedge and 0 indicates otherwise.

 Cluster solutions λi H1 H2 H3 Directors λ1 λ2 λ3 h1 h2 h3 h4 h5 h6 x1 1 2 1 v1 1 0 0 1 1 0 x2 1 2 2 v2 1 0 0 1 0 1 x3 2 1 2 v3 0 1 1 0 0 1 x4 2 1 2 v4 0 1 1 0 0 1

MCLA is based on clustering clusters by first constructing a meta-graph. After transformation into hypergraph representation, MCLA first view all hyperedges as vertices of another regular and undirected graph, the meta-graph. Edges are weighted based on the similarity between vertices measured by the Jaccard measure. Next, the meta-graph is partitioned into k-balanced meta-clusters using METIS (Karypis and Kumar, 1998). Thereafter, hyperedges in each k meta-clusters are collapse into single meta-hyperedge. Finally, each object is assigned to the meta-cluster that it is most associated with. The association is computed by averaging all hyperedges of a meta-cluster. We run MCLA for k = 2 to 10, as the optimal k for our clustering solutions did not go beyond 5. Intuitively, it is highly unlikely that the optimal k associated with the consensus solution will go beyond 10.

We also perform two other consensus clustering algorithms described in Strehl and Ghosh (2002). Cluster-based Similarity Partitioning Algorithm (CSPA) finds the consensus solution by first creating a n x n binary similarity matrix in which two objects are denoted as 1 if they are in the same cluster, and 0 if otherwise. An induced similarity measure is then calculated. Subsequently, the algorithm utilises METIS (Karypis and Kumar, 1998) to recluster the objects based on the similarity measure, yielding a consensus clusters solution. HyperGraph Partitioning Algorithm (HGPA) re-partitions the data by viewing original clusters as indications of strong bonds. To find clusters, the algorithm partitions the hypergraph by separating the least number of hyperedges.

### 1.1.2. Cross - Sectional Regressions

To investigate whether our sample supports our hypotheses, consistent with Ahn et al. (forthcoming), we model the number of directorships as a function of directorial characteristics and associated firm characteristics. Accordingly, we use age, education qualifications and prior or current employment as proxies for directors' quality. We also include tenure to measure a director's commitment to a firm. For firm quality variables, we include firm size, firm age, inclusion in the Straits Times Index and governance quality. We also investigate the relation between block ownership and optimal number of directorships.

A total of eighteen independent variables are investigated; half being directorial characteristics and the other half being firm characteristics. Directorial characteristics are: age, tenure, founder of a listed company, prior or current employment as civil servant, CEO or Chairman, CFO or COO, partner of tier 1 accounting or law firms or academic, gender and education qualifications. Firm characteristics are: governance quality, firm age, block ownership structure defined as family firms, GLCs or founder-managed firms, inclusion in Straits Times Index, firm size and main country of business. We control for firms' financial and operation stability measured by its leverage and stock return volatility, industry growth opportunities and profitability measured by industry average market to book ratio and industry average Return on assets (ROA). The dependent variable comprises of the number of directorships held by a director as of Dec 31, 2008.

There are two possible ways of performing the regression, one by considering the total number of directorships and the other by considering the total number of directors. To ensure that our results are robust, we perform regressions based on both ways. Moreover, to ensure that the results are robust to board appointments, both partial and full models are performed on the full sample of all directors, a subsample comprising of outside directors and a subsample consisting of only independent directors.

The full model is given as,

 Log(Number of Directorships held) = α + β1Age + β2Tenure + β3Founder-Director + β4Civil Servant + β5CEO/Chairmain + β6Other key positions + β7Partners of Tier 1 accounting/law firm + β8Academic + β9Female + β10Degree Holder + β11GTI Score + β12Firm Age + β13Family firms + β14Government Linked Firms + β15Founder-Managed Firms + β16STI Dummy+ β17Firm Size + β18Main Country of Business + β19Debt to Total Asset + β20Return Volatility + β21Industry Market Price to Book Value + β22Industry ROA

As our study is concerned with the modeling of the number of directorships, Poisson regression is naturally chosen as it is a regression model employed in modeling count data. Although negative binomial regression can be utilized to model count data, it is typically used when there are signs of overdispersion in Poisson regression. Our data show little signs of overdispersion and thus, Poisson regression is determined to be a suitable regression model for our sample.

To address the hypotheses developed in Chapter Three, we compute the optimal number of directorships using estimated coefficients obtained from the regressions results and compare them with the actual number of directorships held for directors in various clusters.

Outside directors refer to non-independent and independent non-executive directors