This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Customer churn is the business term that is used to describe loss of clients or customers. Banks, Telecom companies, ISPs, Insurance firms, etc. use customer churn analysis and customer churn rate as one of their key business metrics, because retaining an existing customer is far less than acquiring a new one. Corporates have dedicated departments which attempt to win back defecting clients, because recovered long term customers can be worth much more to a company than newly recruited clients.
Customer Churn can be categorized into voluntary churn and involuntary churn. In voluntary churn, customer decides to switch to another service provider, whereas in involuntary churn, the customer leaves the service due to relocation, death, etc.
Businesses usually exclude involuntary churn from churn prediction models, and focus on voluntary churn, because it usually occurs due to company-customer relationship, on which the company has full control.
Churn is usually measured as gross churn and net churn. Gross churn is calculated as loss of previous customers and their associated recurring revenue, generated by those customers. Net churn is measured as sum of Gross Churn and addition of new similar customers. This is often measure as Recurring Monthly Revenue (RMR) in the Financial Systems.
Predicting and preventing customer churn is becoming the primary focus of many enterprises. Every enterprise wants to retain its each and every customer, in order to maximize maximum profits and revenue from them. With the introduction of business and management systems, and automation of operation flow, corporates have gathered lots of customer and business related data during the daily operating activities, which give data mining techniques a good ground for working and predicting. Lots of data mining algorithms and models have emerged to rescue from this issue of customer loss. These algorithms have been widely used, from past decades, in this field.
The study of customer churn can be seen as a classification problem, with two classes: Churners and Non-Churners. The goal is to build a classifier from training dataset to predict potential churn customers. However, the performance of a classifier is subject to the distribution of class samples in a training dataset. Imbalanced distribution of class samples is an issue in data mining as it leads to lower classification performances.
For prediction of customer churn, many algorithms and models have been applied. Most common of them are Decision tree , Artificial Neural Network , Logistic Regression . In addition, other algorithms such as Bayesian Network , Support Vector Machine , Rough set , and Survival Analysis have also been used.
In addition of algorithms and models, other techniques, such as input variable selection, feature selection, outlier detection, etc. have also been applied to get better results out of the above algorithms.
First three models i.e. Decision tree, Artificial Neural Network and Logistic Regression have been applied maturely at multiple corporates. Each algorithm has been improved over multiple iterations, and are now pretty much stable. But as the operation and activities of business are growing, it is becoming more and more complex challenge to solve the problem of customer churn, and this is requesting for the generation of new churn prediction models, which are fast and robust, and which can quickly be trained and scored on large amounts of data.
Predictive modeling is similar to the human learning experience in using observations to form a model of the important characteristics of some phenomenon. This approach uses generalizations of the 'real world' and the ability to fit new data into a general framework. Predictive modeling can be used to analyze an existing database to determine some essential characteristics (model) about the data set. The model is developed using a supervised learning approach.
This has two phases: training and testing. Training builds a model using a large sample of historical data called a training set, while testing involves trying out the model on new, previously unseen data called a test set, to determine its accuracy and physical performance characteristics. Applications of predictive modeling include customer retention management, credit approval, cross-selling, and direct marketing. Supervised classification is one of the techniques associated with predictive modeling.
In supervised classification,
A training data set is used to generate the class descriptions (predictive models). For each record of the training set, the respective class to which it belongs is also known. Using the training set, the classification process attempts to generate the descriptions of classes (predictive models). These descriptions are then used to classify the unclassified records.
A test data set is used to measure the effectiveness of a classification. A test data set can be used to determine the effectiveness of a classification method. A set of test records whose classifications are already known are passed through the classifier and the resulting classifications are compared with the known classifications. The percentage of matching classifications is the measure of effectiveness of the classification method.
Customer Churn Prediction Method
In Telecommunication industry, core of churn prediction models is to construct effective churn prediction models, which can identify customers with high churn probabilities. These Probabilities work as first step in customer detainment.
These churn prediction models apply mathematical data mining techniques on top of customer characteristics such as customer basic information, billing details, call details records, call center records, service usage patterns, and etc. to predict customer churn probabilities.
The telecommunication industry employ data mining techniques to construct predication models, its process can be categorized into following five phases.
Investigation and Sampling
Each Telecom has a different definition for the customer churn, and data layouts and storage is also different for each telecom operator. So a comprehensive investigation is required, before the start of modeling process, to get clear customer churn definition and positive customer based on their practical work. Modelers should also understand the conditions available in the data in detail, so that they can propose a feasible and practical data sampling solution. The data sampling solution should depict the sample size, list of attributes, time scale and the proportion of actual churn and positive customers.
Data understanding and data preparation
This is usually compromised on four phases a)determination and collection of initial data, b) understanding and describing the data that has been collected, c) conduct initial data profiling and checking of data quality and d) creating new derived variables on top of initial data, selection of variables for modeling and cleansing and transformation of data for better results.
Two basic hypotheses are bases of Customer Churn Models
Churn tendencies are different in different customer segments.
Customers that are likely to churn will have an unusual behavior.
Customer churn models are used to calculate the probability of customer churning P(X), where X is the information of the customer stored in the database, such as his calling patterns, VAS usage patterns, demographic information, etc. P(X) calculates probability within [0,1]. Commonly used Data mining algorthims for customer churn predication are multiple regression, decision tree, neural network, and sometime combination of all three algorithms.
Assertion for classification of customers is normally not provided by the prediction models. Models return probability from 0 to 1 to indicate that a customer belongs to a specific class, rather than yes or no conclusion to whether a customer will churn or not. Models usually take all constraints into consideration and then set a customer churn probability thresholds to accordingly classify customers. Through predication company can set their focus on small count of customers, who have been marked by the prediction model to churn away, rather than targeting everyone with a mass campaign.
These Customer churn predication models also provides explanation, with the probability, that why customers are churning for the organization. Predication algorithms tell us the regular patterns, which indicate that why customers are churning. It will also tell that which patterns have a higher impact on customer loyalty and on customer churn.
The Model is evaluated via Response rate, captured rate, lift value, ROC value and revenue curve. Response rate indicates the percentage of predicted churners vs. real churners. Lift value indicates the ratio of real churned customers within the top selected predicated probabilities of churn.
Lastly based on the results of Model Evaluation, Model is Adjusted if required. If the Model produces satisfactory results, there is no need for it, otherwise model should be adjusted, by selecting different data mining model or different input variables.
Input Variable Selection
Significance of variable selection
In churn predication models, the number of variables is very large, and the variables are key for getting the information out of the customer data. However, there could be redundant, positive, noisy and less informative variables for predication. Using all variables as inputs to train churn prediction models will bring burdens to the training process of models; furthermore, some variables may have negative impact on the predictive abilities of models. Hence, variable selection is a quite important step.
B. Principles of variable selection
As mentioned above, variable creation means to extract potential customer churn character information as much as possible. So this information (large amount of variables) unavoidably contains lots of overload information. Input variable selection achieves both data cleaning and data reduction by selecting important features and omitting redundant, noisy, or less informative ones (L. Yan et al., 2003). So, it is necessary to extract useful and brief ones from so many variables.
The classifying abilities of the variables should be high
Classifying ability here means the ability of a variable to classify/predict churn customers and positive customers. Many researches have paid much attention to this problem: It has been suggested by Meyer-Base and Watzel that neural networks can be used for feature selection (Meyer-Base and Watzel, 1998). Ng and Liu have performed feature selection by running an induction algorithm on the dataset.( Ng K, Liu H, 2001) A method suggested by Datta et al. involves initially finding a subset of features from the data warehouse by manually selecting those that appear most suitable for the task.( Datta et al., 2001)
Jiayin Qi and Yuanquan Li  used AUC to measure the predictive ability in our study which was proposed by Yan et al. (Yan L, Wolniewicz R, 2004).
AUC(Area Under ROC Curve)Method
AUC is the area between an ROC (Receiver Operating Characteristic) curve and the X axis. X(Po) is the abscissa, while Y(Po) is the Y-axis. According to the definition of ROC curve, Y(Po) is the sensitivity of the model for a given probability cutoff point Po . The sensitivity is a measure of accuracy for predicting target A that is equal to the number of correctly predicted target as A divided by the total number is actual target A under cutoff point Po.
Po is the number of incorrectly predicted target A for a given probability cutoff point divided by the number of non-A.
ROC curve is a graphical display that gives the initiative measure of classifying accuracy. If the ROC curve in the lower left corner has a steep upward trend, it means the prediction model has a high sensitivity even with strict selection criteria. Thus the model is proved to have high accuracy. The closer the ROC curve is to the upper-left, the higher classifying accuracy
the model is. The area under ROC curve-AUC is a frequently used index to evaluate the classifying accuracies of models.
In practical applications, a variable is considered to be useful for churn prediction only when its AUC is larger than 0.5. In this way, the number of the variables can be determined. Two variables are kept only when the mutual information between the two variables is smaller than 0.5 and then the number of the variables used for building up prediction models can be determined. The variables selected can be used as inputs for initial churn prediction models. If the performances of the models are satisfying, the variables could be the final variables for prediction, otherwise future adjustment of the cutoff values of AUC and mutual information should be made to get better performances.
Mutual information between variables should be relatively low
Mutual information is a concept to measure how much one variable tells about another one. A large mutual information between two variables means the two variables share similar information. In this case, we can deselect some redundant variables, to make sure the mutual information between variables is relatively low.
C. Variable Selection Process
Jiayin Qi and Yuanquan Li  proposed five steps to complete the variable selection Process:
For each variable compute the predictive ability, which is equals to its AUC value.
Select Top N variables with higher predictive ability. Usually variables whose AUC value is bigger than 0.5 will be selected.
Compute the mutual information of every two variable pairs selected in the earlier step.
Select M variables based on the principle of low mutual information. If two variables have high mutual information, then the variable with higher AUC will be selected. Typically variables whose mutual information is smaller than 0.4 forms are selected. Then, to remain only one variable for each different type of customer characteristics (such as customer basic information, customer service information, customer calling information, and customer billing information etc.). This only one variable should be the one whose AUC is the highest among all the variables in this type of customer characteristics and whose mutual information with other variables are relative low.
Use the input variable set, selected in earlier step, and complete customer churn prediction. If the evaluation of the prediction is good, then this set can be used as the final input variables for the prediction model. And the variable selection process is over. If the evaluation of the prediction is not good, then the variable combinations with highest AUC in the dropped set should be added to the final set respectively, and the prediction results are tested individually. The new set corresponding to the best prediction result is the selected variables for input variables of the prediction model. If there is still no acceptable prediction results after the adding any variable combination in set, then the prediction model should be re-built.
New Feature Selection Approach
Huang and Kechadi  proposed a new technique for Feature Selection for the churn prediction models. As their primary focus was telecommunication industry, and in telecom the amount of input variables / feature is very large, and it is always better to select a subset of features, which have the most ability to classify the target classes. Otherwise running algorithm on all the input variables will be too much to time and resource consuming. Most commonly used techniques for selection of features only judges whether an input feature is helpful to classify the classes or not. The approach proposed by them takes into account the relationship between the specified categorical value of the feature and a class for selecting or removing the feature.
Huang's and Kechadi's  concept for taking into account the categorical values into account when feature selection is being performed, is good. But their concept is limited to categorical values and continues values can't be applied on their approach. Continues values need to be discretized into categorical values, before their feature selection concept could be applied, but this conversion from continues to discrete may result in loss of information.
Luo, Shao and Liu  studied Decision Tree as a predictive model that is used to make predictions through a classification process. The predictive model is represented as an upside down Tree-root at the top (or on the left-hand side) and leaves at the bottom (or on the right-hand side).
Decision Trees represent rules. By following the Tree, you can decipher the rules and understand why a record is classified in a certain way. These rules can then be used to retrieve records falling into a certain category, and the known behavior of the category is the predicted behavior of the entity represented by the record.
In CRM, Decision Trees can be used to classify existing customer records into customer segments that behave in a particular manner. The process starts with data related to customers whose behavior is already known; for example, customers who have responded to a promotional campaign and those who have not; or customers who have churned (left the service for a competitor) and those who have not. The Decision Tree developed from this data gives us the splitting attributes and criterion that divide customers into two categories. Once the rules that determine the classes to which different customers belong are known, they can be used to classify existing customers and predict behavior in future. For example, a customer whose record shows attributes similar to those customers who have churned in the recent past is more likely to churn, and that is the prediction that marketers are looking for to plan activities to pre-empt the churn.
Luo, Shao and Liu  studied effect of data selection methods i.e. under-sample, over-sample and random-sample on the efficiency and correctness of the predicted values, and found that random-sampling provides the best results. They also studied the sub-period of 10-days, 20-days, 30-days and 60-days while developing different models based on 180 days of data. And they found out that the minimum the sub-period time, more is the efficiency of the model. The third perimeter that they studied during their research work was the acceptable, misclassification rate, and proposed that 1:5 is the best-suited misclassification rate.
A record enters the Decision Tree at the root node. At the root, a test is applied to determine which child node the record will encounter the next.
Splitting attribute: Associated with every node of the Decision Tree is an attribute, called the splitting attribute, whose values determine the partitioning of the data set when the node is expanded.
Splitting criterion: The qualifying condition on splitting attribute for is called the splitting criterion. For a numeric attribute, the criterion can be an equation or an inequality. For a categorical attribute, it is a membership condition on a subset of values.
This process is repeated until the record arrives at a leaf node. All the records that end up at a given leaf of the Tree are classified in the same way. There is a unique path from the root to each leaf. The path is a rule, which is used to classify the records.
Advantages of Decision Tree
Decision trees are simple to understand and interpret. People are able to understand decision trees model after a brief explanation.
Data preparation for a decision tree is basic or unnecessary. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed.
Is able to handle both nominal and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. Ex: relation rules can be only used with nominal variables while neural networks can be used only with numerical variables.
Is a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by Boolean logic. An example of a black box model is an artificial neural network since the explanation for the results is excessively complex to be comprehended.
It is possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
Is robust, perform well with large data in a short time. Large amounts of data can be analyzed using personal computers in a time short enough to enable stakeholders to take decisions based on its analysis.
Decision Tree map nicely to a set of business rules and can be easily applied to real problem.
Decision Tree Algorithm makes no prior assumptions about the data.
Limitations of Decision Tree
Output attribute for decision tree algorithm must be categorical
It is limited to only one output attribute
Decision tree algorithms are unstable, even the small changes in the input data can cause large changes in the decision tree. Change in variable, exclusion of duplicates, alteration in variable sequence might possible require redrawing of the whole tree. Decisions are based on expectations in decision trees, and irrational expectations might lead to errors in decision trees.
Creation of Decision Trees from numeric datasets can be complex, and also creation of Decision Trees which are large and have many branches is also complex and time consuming.
A Bayesian network is a kind of graphics mode used in showing the joint probability among different variable. This model provides a natural way to describe the causality information which can be used in discovering the potential relations in data. The conception of Bayesian networks was first proposed by Judea Pearl (1986), as in , which systematically elaborated the related concepts and principles. As the development of artificial intelligence, Bayesian networks have been successively used in knowledge representation of expert system, data mining and machine learning. In recent years, the studies and application on Bayesian networks begin to cover most fields of artificial intelligence, including causal reasoning, uncertain knowledge representation, pattern recognition cluster analysis and etc.
A Bayesian network consists of many nodes representing attributes connected by some lines, so the problems are concerned that more than one attribute determine another one which involving the theory of multiple probability distribution. Besides, since different Bayesian networks have different structures and some conceptions in graph theory such as tree, graph and directed acyclic graph can describe these structures clearly, graph theory is an important theoretical foundation of Bayesian networks as well as the probability theory.
Bayesian network has the ability to process incomplete datasets.
Bayesian network has the ability to study causality
Bayesian network has the ability to consider prior knowledge
Bayesian network has the ability to effectively prevent over fitting
If the dataset is large, the structure learning of the Bayesian networks will be too difficult.
Jiayin, Yangming, Yingying and Shuang  proposed a new algorithm for churn prediction and called it TreeLogit. This algorithm is combination of ADTree and Logistic Regression models.
ADTree is the new Decision Tree algorithm that learns the benefit of boosting algorithm in terms of classification accuracy. This algorithm is more interpretive than the traditional decision tree algorithm. Logistical regression is a standard statistical method, which requires that the model constructer has a prior knowledge of the analysis object and develops simplified assumptions of the relationships inside or outside the object. Then the model is constructed based on these assumptions.
TreeLogit incorporates the advantages of both algorithms and making it equally good as TreeNet® Model which won the best prize in 2003 customer churn prediction contest. As Treelogit combines the advantages of both base algorithms so it becomes very powerful tool for customer churn prediction.
The Modeling process of TreeLogit starts by Designing Customer's character variables based on prior knowledge. Then the character variables are categorized into m sub-vectors, and a decision tree for each sub-vector is created. Once we have the decision tree for each sub-vector, then we develop logistic regression models for each sub-vector. And finally we evaluate the accuracy and interpretability of the model. If they are acceptable then the customer retention process is started, otherwise the model is re-tuned for better results.
Jiayin, Yangming, Yingying and Shuang  TreeLogit combines the advantages of both algorithms i.e. ADTree and logistic regression, thus it is both data-driven and assumption-driven and it has the capability of analyzing objects with incomplete information. Moreover, its efficiency is not affected by the bad quality data and it generates continues output with relatively low complexity.
Advantages of ADTree
Advantages of Logistic Regression
Data-driven mode, highly automated analysis process
Assumption-driven, integrated with analyst's prior knowledge
Capable of analyzing objects with incomplete information
Request the complete information of analysis object
Not affected by dirty data
Sensitive to the isolated points
Roughness and discrete predicted value
Smooth and continuous output.
High degree of complexity
Low degree of complexity
Support Vector Machines
Jing and Xinghua  in their work on customer churn prediction, presented a model based on Support Vector Machines. Support Vector Machines are developed on the basis of statistical learning theory which is regarded as the best theory for the small sample estimation and predictive learning. The studies on the machine learning of finite sample were started by Vapnik in sixties of last century and a relatively complete theoretical system called statistical learning theory was set up in nineties. After that, Support Vector Machines, a new learning machine was proposed. SVM is built on the structural risk minimization principle that is to minimize the real error probability and is mainly used to solve the pattern recognition problems. Because of SVM's complete theoretical framework and the good effects in practical application, it has been widely valued in machine learning field.
Algorithm is fit for the finite samples
This Algorithm has the ability to get the global optimization point but not the local extremum
The samples dimension can't affect the algorithm complexity
There are some difficulties in theory
SVM has many types and is not easy to choose a fitting one
Rough set is a data analysis theory proposed by Z. Pawlak. Its main idea is to export the decision or classification rules by knowledge reduction at the premise of keeping the classification ability unchanged. This theory has some unique views such as knowledge granularity which make Rough set theory especially suitable for data analysis. Rough set is built on the basis of classification mechanism and the space's partition made by equivalence relation is regarded as knowledge. Generally speaking, it describes the imprecise or uncertain knowledge using the knowledge that has been proved. In this theory, knowledge is regarded as a kind of classification ability on data and the objects in the universe are usually described by decision table that is a two-dimensional table whose row represents an object and column an attribute. The attribute consists of decision attribute and condition attribute. The objects in the universe can be distributed into decision classes with different decision attributes according to the condition attributes of them. One of the core contents in the rough set theory is reduction that is a process in which some unimportant or irrelevant knowledge are deleted at the premise of keeping the classification ability unchanged. A decision table may have several reductions whose intersection was defined as the core of the decision table. The attribute of the core is important due to the effect to classification.
any preparatory or additional information is unnecessary
easy to remove the data noises
having a good ability of knowledge reduction, be complementary with other models
only the data after discretization can be used
Survival analysis is a kind of Statistical Analysis method to analyze and deduce the life expectancy of the creatures or products according to the data comes from surveys or experiments. It always combines the consequences of some events and the corresponding time span to analyze some problems. It was initially used in medical science to study the medicines' influence to the life expectancy of the research objects. The survival time should be acknowledged widely, that is, the duration of some condition in nature, society or technical process. In this paper, the churn of a customer is regarded as the end of the customer's survival time. In the fifties of last century, the statisticians began to study the reliability of industrial products, which advanced the development of the survival analysis in theory and application. The proportional hazard regression model is a commonly used survival analysis technique which was first proposed by Cox in 1972.
having unique advantages in dealing with time-series data
the data with a big time span and good time continuity is necessary
Genetic K-Means Algorithm
The method to solve customer churn problem can be divided into two catagories. 1.Data Sampling, 2. Cost sensitive learning.