This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In this paper we analyze information about customers of a bank, clustering them into three clusters, using SPSS TwoStep Cluster method. This method is perfect for our case study, because, compared to other classical clustering methods, TwoStep uses mixture data (both continuous and categorical variables) and also finds the optimal number of clusters. TwoStep creates three customers' profiles. The largest group contains skilled customers, whose purpose of the loan is education or business. The second group contains the persons with real estate, but mostly unemployed, which asked for a credit for retraining or for household goods. The third profile gathers people with unknown properties, which makes a request for a car or a television and then for education. The benefit of the study is reinforcing the company's profits by managing its clients more effectively.
Key words: TwoStep Cluster, clustering, pre-clustering, CF tree
JEL Classification: C63, C46, C19
The applications that can use clustering algorithms are from various fields. Though, most of these algorithms work with numerical data or categorical data. However, data from real world contain both numerical and categorical attributes. TwoStep Cluster is a SPSS method which solves this problem.
In present paper we intend to identify the bank customers' profiles, starting with a public dataset provided by a German bank and using TwoStep Cluster. This method has the advantage of determination the proper number of clusters, so the aim is finding this number of profiles, for managing the clients and the possible clients effectively.
Next we present TwoStep Cluster and our case study with inputs, outputs and the interpretation of the results.
Grouping data (or clustering data) is a method that can form classes of objects with similar characteristics. Clustering is often confused with classification, but there is a major difference between them, namely, for classification, the objects are assigned to predefined classes, but in the case of clustering, those classes must be defined too.
Clustering techniques are used when we expect the data to group together naturally in various categories. The clusters are categories of items with many features in common, for instance, customers, events etc. If the problem is complex, before clustering data, other data mining techniques can be applied (such as neural networks or decision trees).
Classical methods of clustering use hierarchical or partitioning algorithms. The hierarchical algorithms form the clusters successively, on the basis of clusters established before, while the partitioning algorithms determine all the clusters at the same time, building different partitions and then evaluating them in relation to certain criteria.
In SPSS  , clustering analysis can be performed using TwoStep Cluster, Hierarchical Cluster or K-Means Cluster, each of them relying on different algorithm to create the clusters. The last two are classical methods of classification, based on hierarchical, respectively partitioning algorithms, while TwoStep method is specially designed and implemented in SPSS.
In terms of types of data considered for application, Hierarchical Cluster is limited to small datasets, K-Means is restricted to continuous values and TwoStep can create cluster models based on both continuous and categorical variables.
Next, we approach TwoStep method, highlighting the advantages of this method in the field discussed.
The TwoStep Cluster Analysis
TwoStep Cluster is an algorithm primarily designed to analyze large datasets. The algorithm groups the observations in clusters, using the approach criterion  . The procedure uses an agglomerative hierarchical clustering method  . Compared to classical methods of cluster analysis, TwoStep enables both continuous and categorical attributes. Also, the method can automatically determine the optimal number of clusters.
TwoStep Cluster involves performing the following steps:
solving atypical values (outliers) - optional;
In pre-clustering step, it scans the data record one by one and decides whether the current record can be added to one of the previously formed clusters or starts a new cluster, based on the distance criterion  . The method uses two types of distance measuring: Euclidian distance and log-likelihood distance  .
Pre-clustering procedure is implemented by building a data structure called CF (cluster feature) tree, which contains the cluster centers. The CF tree consists of levels of nodes, each node having a number of entries. A leaf entry is a final sub-cluster. For each record, starting from the root node, the nearest child node is found recursively, descending along the CF tree. Once reaching a leaf node, the algorithm finds the nearest leaf entry in the leaf node. If the record is within a threshold distance of the nearest leaf entry, then the record is added into the leaf entry and the CF tree is updated. Otherwise, it creates a new value for the leaf node. If there is enough space in the leaf node to add another value, that leaf is divided in two values and these values are distributed to one of the two leaves, using the farthest pair as seeds and redistributing the remaining values based on closeness criterion.
In the process of building the CF tree, the algorithm has implemented an optional step that allows solving atypical values (outliers). Outliers are considered as records that do not fit well into any cluster. In SPSS, the records in a leaf are considered outliers if the number of records is less than a certain percentage of the size of the largest leaf entry in the CF tree; by default, this percentage is 25%. Before rebuilding the CF tree, the procedure searches for potential atypical values and put them aside. After the CF tree is rebuild, the procedure checks if these values can fit in the tree without increasing the tree size. Finally, the values that do not fit anywhere are considered outliers.
If the CF tree exceeds allowed maximum size, it is rebuilt based on the existing CF tree, by increasing the threshold distance. The new CF tree is smaller and allows new input records.
The clustering stage has sub-clusters resulting from the pre-cluster step as input (without the noises, if the optional step was used) and groups them into the desired number of clusters. Because the number of sub-clusters is much smaller than the number of initial records, classical clustering methods can be used successfully. TwoStep uses an agglomerative hierarchical method which determines the number of clusters automatically.
Hierarchical clustering method refers to the process by which the clusters are repeatedly merged, until a single cluster groups all the records. The process starts with defining an initial cluster for each sub-cluster. Then, all clusters are compared and the pair of clusters with the smallest distance between them is merged into one cluster. The process repeats with a new set of clusters until all clusters have been merged. Thus, is quite simple to compare the solutions with different number of clusters.
To calculate the distance between clusters, can be used both Euclidian distance and the log-likelihood distance.
The Euclidian distance can be used only if all variables are continuous. The Euclidian distance between two points is defined as the square root of the sum of the squares of the differences between coordinates of the points  . For clusters, the distance between two clusters is defined as the Euclidian distance between their centers. A cluster center is defined as the vector of cluster means of each variable  .
Log-likelihood distance can be used both for continuous and categorical variables. The distance between two clusters is correlated with the decrease of the natural logarithm of likelihood function, as they are grouped into one cluster. To calculate log-likelihood distance, it is assumed that the continuous variables have normal distributions and the categorical variables have multinomial distributions, and also the variables are independent of each other.
The distance between clusters i and j is defined as  :
and in equation (2),
is distance between clusters i and j; index that represents the cluster formed by combining clusters i and j; is total number of continuous variables; is total number of categorical variables; is the number of categories for the k-th categorical variable; is the total number of data records in cluster s; is the number of records in cluster s whose categorical variable k takes l category; is the number of records in categorical variable k that take the l category;- the estimated variance (dispersion) of the continuous variable k, for the entire dataset; - the estimated variance of the continuous variable k, in cluster j.
To determine the number of clusters automatically, the method uses two stages. In the first one, the indicator BIC (Schwarz's Bayesian Information Criterion) or AIC (Akaike's Information Criterion) is calculated for each number of clusters from a specified range; then this indicator is used to find an initial estimation for the number of clusters.
For J clusters, the two indicators are computed according to equations (4) and (5), thus  :
The relative contribution of variables to form the clusters is computed for both types of variables (continuous and categorical).
For the continuous variables, the importance measure is based on:
is the estimator of k continuous variable mean, for entire dataset, and is the estimator of k continuous variable mean, for cluster j.
In H0 (the null hypothesis), the importance measure has a Student distribution with degrees of freedom. The significance level is two-tailed.
For the categorical variables, the importance measure is based on test:
which, in null hypothesis, is distributed as a with degrees of freedom.
Regarding the cluster membership of the items, the records are allocated on the specifications of resolving atypical values (the noises) and the options for measuring the distances.
If the option of solving the atypical values is not used, the values are assigned to the nearest cluster, according to the method of distance measuring. Otherwise, the values are treated differently, as follows:
in the case of Euclidian method, an item is assigned to the nearest cluster if the distance between them is smaller than a critical value,
Otherwise, the item is declared as noise (outlier).
If the log-likelihood method is chosen, it assumes the noises follow a uniform distribution and computes both the log-likelihood corresponding to assigning a item to a noise cluster and that resulting from assigned it to the nearest non-noise cluster. Then, the item is assigned to the cluster that has obtained the highest value of logarithm. This is equivalent to assigning an item to the nearest cluster if the distance between them is smaller than a critical value. Otherwise, the item is designated as noise.
In conclusion, an important advantage of the method handles with mixed data (both continuous and categorical data). Another advantage is that, although the TwoStep method works with large datasets, in terms of time required for processing such data, this method needs a shorter time than other methods  . As a disadvantage, the TwoStep method does not allow missing values and the items that have missing values are not considered for analysis.
Since TwoStep Cluster is often preferred once for large datasets and two for handling mixture data, we applied this method using some public data referred to clients of a bank for clustering this data. (On the other hand, some of these data were used in another application to reduce the dimensionality applying PCA - Principal Component Analysis)  . The input and the output of this method are presented further below.
A related paper presents a study for control client using the same method which we used in this paper  . The authors propose a policy for consolidating a company's profits by selecting the clients using the cluster analysis method of CRM (Client Relationship Management), managing better the resources. For the realization of a new service policy, they analyze the level of contribution of the clients' service pattern: total number of visits to the homepage, service type, service usage period, total payment, average service period, service charge per homepage visit and profits through the cluster analysis of clients' data. The clients were grouped into four clusters according to the contribution level in terms of profits.
The dataset that has been used for our case study has been obtained from a public database that contains credit data of a German bank  . The dataset has 1000 records and is presented in a table in SPSS. This table contains information about duration of the credit, credit history, purpose of the loan, credit amount, savings account, years employed, payment rate, personal status, residency, property, age, housing, number of credits at bank, job, dependents and credit approval. In Table 1 we present some of these data.
Table 1. Source data
The database contains 9 categorical variables and 7 continuous variables. Continuous variables are standardized by default. Because we use mixture data, we have only log-likelihood option for distance measure.
In the first running, we choose BIC to determine the number of clusters, though we may override this and specify a fixed number. The results obtained using AIC running are not different from those obtained with BIC, so below we present only those obtained with BIC indicator.
Regarding the noises (the outliers) from our dataset, we do not check the noise handling option. Outliers are defined as cases in CF tree, in other leaves with fewer than the specified percentage of the maximum leaf size.
An important option given by SPSS is to export in XML format the CF tree or the entire model. This allows the model to be updated later for additional datasets.
The Auto-Clustering statistics table in SPSS output can be used to assess the optimal number of clusters in our analysis, as shown in Table 2.
Table 2. Auto-Clustering
Number of clusters
Schwarz's Bayesian Criterion (BIC)
Ratio of Distance Measures
In Table 2, although the lowest BIC coefficient is for seven clusters, according to the SPSS algorithm, the optimal number of clusters is three, because the largest ratio of distances is for three clusters. The cluster distribution is shown in Table 3.
Table 3. Cluster distribution
% of Combined
% of Total
SPSS presents also the frequencies for each categorical variable. Table 4 shows the frequencies for variable SavingsAccount.
Table 4. Frequencies for SavingsAccount variable
Note: * F -Frequencies; % - Percent
The cluster pie chart from Figure 1 shows the relative size for our three clusters solution.
Fig. 1. Cluster size
For categorical variables, the within-cluster percentage plot shows how each variable is split within each cluster. In Figure 2, it is shown the contribution of variable Property within each of the three clusters. Note that in cluster 1, predominant property is unknown, while in cluster 2 is real estate and car in cluster 3.
Fig. 2. The weight of Property in each cluster
SPSS gives the importance plot for each variable (categorical or continuous). In Figure 3 we present the importance of the categorical variables for the first two clusters.
Note that Property and Housing contribute the most to differentiating the first cluster and PersonalStatus, CreditHistory, Housing and YearsEmployed differentiate the second.
Fig. 3. Categorical variablewise importance for clusters 1 and 2
Regarding the continuous variables importance for cluster 3, we note the cluster 3 is differentiating by the top four variables (the number of credits at bank, the dependents, the age and the payment rate) in a positive direction and only by ResidenceSince in a negative direction, but the positive variables contribute more to differentiation of cluster 3.
Fig. 4. Continuous variablewise importance for cluster 3
We present the following conclusions after the results provided by TwoStep Cluster.
The first cluster, which fills 15%, contains mostly single male customers, which occupy management positions (34.5%) or are unemployed (27.3%), they have unknown properties and their loan is approved in a small percentage (11.9%).
Cluster 2 fills 35.1%, contains female or married male customers with real estate (54.6%), mostly unemployed (54.5%) or unskilled (47.5%) and the purpose of the loan is appliances, retraining and furniture.
The most important cluster is the third. This is the largest cluster (49.9%) containing mostly single male or divorced male customers, with the largest saving accounts, between 4 and 7 years employed, occupying management positions (54.7%) or being skilled workers (50.6%), with a history of credit okay; the purpose of the loan is for business, cars (new or used), or for education; they have their own housing (65.1%) and their loan is approved in a large percentage (55.9%).
Clustering methods can be applied in various fields which use large datasets, just to find hidden patterns. Since most data taken from the real world (as in banking field, in our case) contain both numerical and categorical attributes, classical clustering algorithms can not work efficiently with such data. To solve this problem, we showed that TwoStep method can be easily used, which also determines automatically the optimal number of clusters.
Using this method applied to our data, we identified three customers' profiles. The most important profile contains skilled customers with no bad credit history, whose purpose is to obtain the loan for education or for business. The second profile groups a middle class of customers, unemployed, but with real estate and the loan is for retraining or for household goods. The third profile groups the persons with unknown properties, mostly unemployed, which want credit for things such as new or used cars or for television, and then for education.
This case study is useful for a bank company which intends to consolidate the company profit, by managing better the clients or the possible clients, in loan granting.