This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Cluster analysis is typically applied to data consisting of many variables that have been collected from a large sample of respondents. The cluster analysis procedures search through the data and identify respondents who have given identical, or at least identical, or at least similar, answers to a certain combination of questions. These respondents are formed into one cluster.
Most important application of cluster analysis is in market segmentation studies. Whenever researchers try to segment the market on the basis of several variables, they prefer to use cluster analysis.
Real Time application
Suppose a sporting goods company wishes to identify different market segments that constitute the total market for sport related equipments.
They will select a sample of users of all kinds of sporting equipments, and then they collect information about their attitudes regarding their preference for outdoor/indoor sports, their preference for sporting activities like which sport they prefer and so on.
Then they can use Cluster analysis on these data to see if the total market consists of a number of different segments.
Objective and purpose of Cluster analysis approach in business research
Finding homogenous clusters
Main objective of cluster analysis is to identify different groups (or clusters) of respondents so that the respondents in any one cluster are similar to each other but must be different from the respondents in the other clusters.
Such clusters should be internally homogenous and externally heterogeneous
Purpose in business research
In business research, companies want to find the segment of people who can be their target customers. So in order to find which segment is their potential customer they collect the raw data from the survey. Cluster analysis sorts through the raw data and groups them into clusters. Since cluster are internally homogenous and externally heterogeneous, companies can easily identify the cluster whose respondent can be their potential target customers.
Developing a research plan for Cluster analysis:
Cluster analysis research plan:
Assumptions of Cluster analysis:
Following are some of the assumptions for cluster analysis:
How many clusters
Like the other techniques, cluster analysis presents the problem of how many factors, or dimensions, or clusters to keep.
The dendogram gives us the clustering structure at every stage as well as the distance among the clusters as shown below.
One rule of thumb for this is to choose a place where the cluster structure remains stable for a long distance. Some other possibilities are to look for cluster groupings that agree with existing or expected structures, or to replicate the analysis on subsets of the data to see if the structures emerge consistently.
Assessing model fit
Sample Data is taken for this analysis
We are using ward's method to illustrate hierarchical clustering.
Useful information is contained in the "Agglomeration Schedule"
Case Processing Summary
Agglomeration Schedule shows the number of cases or clusters being combined at each stage.
1st line represents stage 1, with 19 clusters.
Respondents 14 and 16 are combined at this stage, as shown in the columns labelled "clusters combined".
The column "coefficients" represents the squared Euclidean distance between these two respondents.
The column "Stage Cluster First Appears" indicates the stage at which a cluster is first formed i.e. an entry of 1 at stage 6 indicates that respondent 14 was first grouped at stage 1.
The column "Next Stage" indicates the stage at which another respondent or cluster is combined with this one. Since the number in the first line of last column is 6, we see that at stage 6, respondent 10 is combined with 14 and 16 to form single cluster.
Similarly the 2nd line represents stage 2 with 18 clusters. (In stage 2, respondents 6 and 7 are grouped together) dendrogram is read from left to right.
Vertical lines represent clusters that are joined together.
The position of the line on the scale indicates the distances at which clusters were joined.
Decide on the number of clusters
In hierarchical clustering, the distance at which clusters are combined can be used as a criterion for deciding number of clusters.
We can see from the "Agglomeration schedule" that the value in the coefficients column suddenly more than doubles between stage 17(3 clusters) and 18(2 clusters).
Likewise if see the "dendrogram", then we can see that at last two stages of dendrogram, the clusters are being combined at large distances. Therefore, it appears that a three cluster solution is appropriate.
By simply counting the frequency of cluster membership, we see that a 3 cluster solution results in clusters with eight, six, and six elements. However if we go to a 4 cluster solution, the size of the clusters are eight, six, five, and one.
Since it is not meaningful to have a cluster with only one case (respondent), a three-cluster solution is preferable in this solution.
Validating Cluster Solutions
Validating cluster solutions is an important step before using the results for data prediction. It can be described as a process wherein the results of cluster analysis are verified using in a quantitative and objective analysis. The technique can be broadly fragmented into following four steps:
1. Determining data structure: identifying non-randomness
2. Finding out the number of clusters
3. Internal validation to determine if the clustering solution is the best option for the data to be analysed. Internal validation technique does not require additional knowledge
in the form of class labels, but is based on the intrinsic information of the data.
Type 1: Compactness: This step is used to measure cluster compactness or homogeneity using intra-cluster variance and values like sum-of-squared-errors minimum variance criterion and optimized the k-means algorithm
Type 2: Connectedness: This helps in evaluating the partitioning of the data - it does so by calculating the local densities and groups data items together with their nearest neighbours in the data space.
Type 3: Separation: It includes all measures which quantify the degree of separation between individual clusters.
Type 4: Combinations: This describes usage of combination techniques of type one and type three. This can be explained since the two classes measure opposing trends. The net outcome is that the intra-cluster homogeneity improves with increasing number of clusters and the distance between clusters tends to deteriorate.
Type 5: predictive power/stability: This is a special class of internal validation measures as they use additional access to the clustering algorithms. This one key feature of this technique is that it iteratively disturbs the original data, and forms new clusters of the resulting data.
Type 6: Compliance between partitioning and distance information: this directly measures the degree to which distance information in the original data is preserved in a partitioning. The cophenetic correlation is most commonly used for this. It is a measure of how faithfully a dendrogram preserves the pair wise distances between the original unmodeled data points. This algorithm finds more usage in the field of biostatistics (cluster-based models of DNA sequences). It can also be used in other application areas wherein raw data occurs in clusters.
Type 7: Specialized measures for highly correlated data: It comprises all the measures that use redundancy and correlation techniques.
4. External validation: This step determines how good the clustering solution suits the data. It comprises of methods that evaluate a clustering result based on the knowledge of the correct class labels. This step allows an objective evaluation and comparison of clustering algorithms on data provided. External validation can be broadly classified under following two types
Type 1: Unary measures: This involves comparing the values of a set of standard labels for the data cluster with th true value to determine the degree of consensus between the two.
Type 2: Binary measures: This method is based on the fact that the data has a number of indices which can help in evaluation of consensus between a partitioning and the true values. Various such indices are emplioyed for this purpose some important ones include:
Dunn's Validity Index: The main goal of this index is to maximize the inter-cluster distances and minimize the intra-cluster distances.
d(ci,cj): distance between clusters ci, and cj (intercluster distance)
d'(ck)} - intracluster distance of cluster ck
n - number of clusters.
Davies-Bouldin Validity Index: it is the ratio of the sum of within-cluster scatter to between-cluster separation. the ratio is small if the clusters are compact and far from each other. Consequently, Davies-Bouldin index will have a small value for a good clustering.
C index: http://machaon.karanagai.com/validation_algorithms_files/valida4.gif
Here S is the sum of distances over all pairs of patterns from the same cluster. Let S be the number of those pairs. Then Smin is the sum of the smallest distances if all pairs of patterns are considered. Similarly Smax is the sum of the largest distance out of all pairs. Hence a small value of C indicates a good clustering.
Rand index: measures the number of pairwise agreements between a clustering K and a set of class labels C, normalised so that the value lies between 0 and 1:
a: number of pairs of points with the same label in C and assigned to the same cluster in K
b: number of pairs with the same label, but in different clusters, c denotes the number of pairs in the same cluster, but with different class labels
d: number of pairs with a different label in C that were assigned to a different cluster in K
A high value for this measure generally indicates a high level of agreement between a clustering and the annotated natural classes.
Jaccard index: It is used to evaluate the similarity between different partitions of the
a: denotes the number of pairs of points with the same label in C and assigned to the same cluster in K
b: denotes the number of pairs with the same label, but in different clusters
c: denotes the number of pairs in the same cluster, but with different class labels.
The value of J lies in between 0 and 1. A value of 1.0 indicates that C and K are identical.
An Indian case on Cluster analysis: Arranging Stock in a Warehouse
The case analysed by us is that of 'Greenpark Apollo', a pharma company located in Green Park. The corporate strategy manager wanted to align their R&D and production efforts to towards mixed drugs which could cure more than one psychiatric disorder.
In the past the pharmacy had come up with such drugs but the revenue generated was far less than the expected value. Apollo wants to identify which cluster of disorders is more related to each other. This would help them in producing mixed drugs which would lower their production cost and at the same time target a larger number of patients.
A market research with questions on: Spielberger Trait Anxiety Inventory (STAI), the Beck Depression Inventory (BDI), a measure of Intrusive Thoughts and Rumination (IT) and a measure of Impulsive Thoughts and Actions (Impulse) disorders was conducted. Further, people having same disorder should have similar values against the symptoms.
This was the first step of internal validation and was conducted by experienced psychologists.
The table below represents data obtained by market research study conducted by the trained psychologists. Note that data each variable is placed in a separate column. And hence each row of the data represents a single subject's data.
Table 1: Data of Diagnosis
To the above data set, hierarchical cluster analysis using ward method was applied. The following dendrogram was obtained as the key output.
Cluster analysis always works upwards to place every case into a single cluster. The first clustering of data is performed for cases (1, 4, 7, 11, 13, 10, 12, 9, 15, & 2); (5, 14, 6, 8, & 3). This cluster clearly separates the patients of OCD from depression and GAD. This would be because patients of GAD and depression have low score on intrusive thoughts. Further splitting makes two more clusters one belonging to patients of GAD and the other to depression.
Managerial implications of the results
A detailed look into the cluster analysis can help the Apollo's manager take strategic decision on research and development and production of drugs which can counter various symptoms at the same time. From the above cluster analysis we can see that Spielberger Trait Anxiety Inventory syndrome in high number in patients having GAD and OCD. A drug targeting STAI and GAD/OCD would have higher demand than the one targeting only one of STAI, GAD or OCD. Such drug would be economical for the consumer and will also provide company with a competitive edge.
BDI shows high numbers for depression. It makes strategic sense to make two drugs targeting a combination of BDI and depression and BDI and STAI. On similar lines, Intrusive Thoughts and Rumination syndrome occurs most commonly with OCD and a combination drug would make more economic sense. Impulse has low values for the independent variables and so it does not make much sense to use it as a combinatorial drug. It can be produced individually and prescribed to patients which suffer from it specifically.
Data Analysis in SPSS
SPSS (Statistical Package for the Social Sciences) was developed by Norman H. Nie and C. Hadlai Hull and its version was released in its first version in 1968.
There are two ways of programming in SPSS firstly is via the pull-down menus and second via the syntax language. Programming using the syntax language provides advantages of reproducibility, simplifying repetitive tasks and handling complex data manipulations and analyses. Some of the complex applications can only be programmed in syntax and are not through the pull-down menus (the interactive mode), SPSS writes down syntax code reflecting the menu choice and saves it in a journal file. The journal file can be viewed and edited from the syntax window.
There are few disadvantages of using SPSS. Firstly, SPSS has less control over statistical output than most other software like SAS or Stata. SPSS has problems with certain types of data manipulations such as its weak lag functions that transform data across cases.
Prior to the analysis of data, some steps need to be performed to make data manipulation simpler. These steps include:
Reading Data: How to translate raw data or data in another form into SPSS
Transforming Data: creating new variables or change the values of existing variables
Defining Variables: :Labelling data to make it more programmer friendly , and then structuring data so as to make it that SPSS reasdable
After the above steps are performed the data can then be manipulated using various statistical tools to obtain relevant analysis.
Certain descriptive statistics are used for examining the data for the analyses. Some of the most useful ones include - mean, median, mode, frequency, quartiles, sum, variance, standard deviation, minimum/maximum and range.
The user can transform a variable to make its relation with other variables as desired. For example, a variable can be transformed to make the distribution more normal, to make its relationship to another variable more linear or to transform its value to the average of the data set.
Recode is performed when transformation involves categorical variables, for example, when a distinction based median split is to be made. Recode is also useful when the value of a categorical variable needs to be modified or when some of the categories of an existing categorical variable need to be combined. When performing Recode there is an option of same variable or different variable. Same variable option replaces the original values of existing variable. While different variable option creates a set of new variables having the new values. For obvious reasons, different variable is also a preferred mode.
This is a common technique used for analysis of surveys. Certain questions/parameters are given high preference values, mainly to draw the respondents' attention. During analyses these values are brought back to normal and then utilised for analyses.
Analyst can use only a subset of the data file for analysis. Various complex combinations can be chosen using Boolean operator AND (represented by the symbol &) or the Boolean operator OR (represented by the symbol |). For example a specific selection from a data set classified under - senior, junior, male and female can be obtained. The '&' operator will allow a selection of "senior female" subset.
Many tools are available in SPSS to aid faster statistical data analysis. Some of these include:
t Tests: one-sample t-test, Independent-samples t-test, Paired-samples t-test
Analysis of Variance (ANOVA): One-way within-subjects ANOVA, Multifactor within-subjects ANOVA, Mixed ANOVA
Correlation: Pearson correlation, Point-biserial correlation, Spearman rank correlation
Regression: Simple Linear Regression, Multiple Regression, Multiple regression with interactions, Polynomial regression,
Chi-Square Test of Independence
Factor Analysis: SPSS covers only Exploratory factor analysis (EFA) and not the Confirmatory factor analysis (CFA).