# Classification Techniques For Credit Scoring And Data Mining Accounting Essay

Published:

This section will use classification techniques as logistic regression and decision tree for data set of telecom. Then, it will be divided into three parts as preparing the data set, analyzing by logistic regression followed by decision tree and finishing by comparison of both methods.

## Preparing data

Due to the complicated data set, Microsoft Excel is utilized to clean data. The first step is to detect outliers which divided into valid and invalid outliers. Then, the treatment of outliers should be concerned as deleting, replacing and keeping methods. Indeed, some columns as newCellndN and negTrend are deleted because they are same meaning with newCellndY and posTrend respectively. In addition, date of birth is assumed that it should not be before 1920 otherwise it is deleted. Also, the average peak minutes and the average plan should be more than zero and in form of integer. Next, the data should be scoped by using z-score. It means some values which are out the first and third quartiles are deleted.

### Professional

#### Essay Writers

using our Essay Writing Service!

The data set is automatically spited by Weka software with 66.67% as training set and 33.33% as test set. Then, the training set was used to generate good classification for this data set by using logistic regression and decision tree and finally check performance of each classification by the test set.

64.12%

65.26%

0.647

0.665

## Specificity

0.632

0.626 To begin with, the logistic regression is one of method for classification. Regarding the results from Weka, it can automatically generate confusion matrix, classification accuracy, sensitivity and specificity on training and test sets with cut-off of 0.5. All of the results can be summarised as shown in Table 1 and Figure1.

Table 1- Summarising the results of training set and test set from Weka

Figure 1- The results of training set and test set from Weka

The inputs which influence on prediction should be newCellIndY, birthDate, svcStartDt, incomeCode, peakMinDiff, posTrend, nrProm, prom, avPlan, posPlanChange and negPlanChange. There are both positive and negative relationships. Indeed, the most predictive input is negPlanChange because it has the highest coefficient in linear regression.

Then, Weka can generate the ROC curve and calculate AUC of test set which is 0.675. In addition, the accuracy ratio is 0.350.

Figure 2- ROC curve of test set from Weka

## Decision Tree

Decision tree is also one method used to classify variables in the data set. Regarding the prevention of overfitting issue, the test set is used to evaluate the model from training set. In addition, a validation set is used to scope size of decision tree because the validation set is independent from the training set. The performance on the validation set will stop growing the decision tree which leads to the optimal stopped point. However, this software has no function for validation set. Therefore, the training set is same as validation set. Finally, the tree is set C4.5 algorithm as statistic classifier.

Figure 3- The decision tree from Weka

Weka can also generate the result as well as logistic regression method. All of the results are presented in Table 2 and Figure 4.

76.85%

77.42%

0.747

0.808

## Specificity

0.804

0.730

Table 2- Summarising the results of training set and test set from Weka

Figure 4- The results of training set and test set from Weka

Then, from the above result, the AUC of test set is 0.85. In addition, the accuracy ratio is 0.7.

Figure 5- ROC curve of test set from Weka

## Comparison

In conclusion, from the logistic regression result, there are some inputs which are related to this prediction as new user, age, date of contract, income, average peak minutes, positive use trend, the number of promotion sent, average plan, upgrading plan and downgrading plan while the result of other variables shows have no relationship with the prediction of defection. In addition, the variable which is the most predictive is to downgrade plan. For example, if customers downgraded plan, it means that they tend to defect this company.

### Comprehensive

#### Writing Services

Plagiarism-free
Always on Time

Marked to Standard

Regarding the results of both techniques, the performance of decision tree based on test set is better than logistic regression. Owing to the same data set, the comparison can consider from the value of AUC for test set which measures the performance of each techniques. Indeed, the AUC of decision tree is 0.85 which is more than 0.691 for logistic regression. Likewise, the accuracy rate of decision tree is 0.7 while the value for logistic regression is .0382 which is quite low.

## Part 2

This section will discuss some main points of this journal based on concept of data mining and credit scoring which was applied in real life.

## Journal

This paper is "A Data Mining Approach for Identifying Predictors of Student Retention from Sophomore to Junior Year" from Yu et al. Also, the citation of this journal is "Yu, C.H., DiGangi S., Jannasch-Pennell, A. and Kaprolet C. (2010). A Data Mining Approach for Identifying Predictors of Student Retention from Sophomore to Junior Year. Journal of Data Science, 8(2), 307-325".

## Data Mining Problem

According to the national center for public policy and higher education, if the proportion of students who need to persist to the second year is 73.6 percent of all students, there are just 39.4 percent of such students who can graduate. This amount shows the problem of education in many universities in the United State.

In this research, the retention of students is significant to all academic institutes because it seems to be representative of the institutes. Therefore, many universities including Arizona State University (ASU) attempt to investigate into factors for improving the retention rate. In this case, the researchers from ASU presented how data mining techniques could be utilized to analyse those factors affecting the amount of sophomore students who can continue to the third year.

For the above reason, this research used data mining to deal with this problem instead of other classical statistics because of several reasons. Firstly, Shmueli et al. (2007) state that it was developed for large data set and also suits multiple types of data as discrete, ordinal and interval scales. Furthermore, it involves cross-validation which avoid overfitting in only one data set. In other words, the result of this study can adapt to other data sets. Therefore, the data set was divided into training and testing sets which can revise to prevent overfitting problem. Specially, multivariable adaptive regression spline (MARS) was used to balance the local model and global model. Moreover, this technique can arrange outliers and missing data.

## Data Mining Techniques

A data set of this study was the enrollment or withdrawal of 6,690 sophomore students who attended to continue study in the third year at ASU during 2003/2004 academic year. In addition, there were fifteen potential predictors in this data set and all of them may influence on retention rate. In this research, three data mining techniques were used, namely classification trees, MAR and neural networks. Similarly, training set was selected to generate the model and then the outcome was evaluated by test set until no further improvement in the prediction. All techniques will be illuminated as followings.

To begin with, one type of decision tree is classification trees which target to categorize predictors. In this study, another tool which is logistic regression was compared with the classification trees in order to check accuracy of the prediction.

Next, MARS is designed in order to solve regression problems and there are no relationships between independent and dependent variables. Due to the complex problems, the predictor variables are not steadily related retention rate.

Another technique is neural networks which consist of input, hidden and output layers. It was used to evaluate the predictor variables whether has influences on retention rate. The reasons why this technique was used in this investigation are to determine non-linear relationship between the probability of retention rate and predictor variables that were suggested by the above classification tree and MARS and to add some other variables.

Finally, the special tool was used in this study to consider whether the physical locations of students related the retention. This tool is Geographic Information System (GIS) in SAS program.

## Results

### This Essay is

#### a Student's Work

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

After using classification trees, the crucial variables were suggested for predicting retention as transferred credit hours, residency and ethnicity. In this case, the tress has three levels and each level divided data based on these factors. However, there was different result by using logistic regression because there was only one predictor as transferred credit hours impacting retention rate. The result was the negative slop of graph which presents such relationship. To sum up, it seems to be that the classification tree is likely accurate rather than the logistic regression owing to eliminating outliers.

Regarding the result of MARS, there were five variables considered in the process. However, only two of them were direct variables as transferred credit hours and ethnic group because others have missing values. After testing the success of this prediction, the overall success is 73.53%. Indeed, the percentage of successful retention prediction was 67.4% while the proportion for non-retention was 76.95%. Similarly, the sensitivity value was 0.77 and the specificity was 0.67

In this case, the neural network used three hidden layers, three tours and five-fold cross-validation. Also, it was generated by application program to prove the result of the classification tree and MARS which in turn lead to as following results. From the result of the tree, each ethnic group and different transferred hours affected obvious retention rate. Likewise, both residency and transferred hours influenced on the probability of retention. Despite the results of the above techniques, high school rank, American College Test (ACT) z-score, Scholastic Aptitude Test (SAT) z-score and university math placement test were considered in this stage. Indeed, the university math test was the most influence on retention rate.

Finally, the retention rate of resident in Arizona state is higher than the proportion of non-residency which was .67 and .33 respectively. In addition, from testing the variable, it can be assumed that residency of Arizona state could make high retention rate.

## Critical Discussion

The university should focus on developing quality of its internal test as university math placement test because the result of the neural network shows that it has the most effect on the retention and can be also recognized as tool for evaluating the performance of students before studying in the university.

Due to the fact that this data set was retrieved from only single database of ASU, the results of this study may not suite other universities. Also, Mortensen (2005) suggests that the retention rate of private universities is usually more than the public ones. Therefore, the method can be only applied to use in other public institutes.

## FICO credit score

When a customer requires borrowing money from a bank, the bank needs to know risk of this loaning. For example, if a customer need to borrow money from RBC bank, this bank will ask information of customer from Lloyd, HSBC, or other banks to evaluate whether it should lend money and decide about loan term based on the credit score of this customer. In addition, FICO credit score is mostly recognised as tool which lenders utilize to determine customers' risk. There are FICO scores based on information from different credit bureaus as Experian, Equifax and CallCredit in the United Kingdom (Achou and Tenguh, 2008). It seems to be that the credit of borrowers depends on historical information that three credit bureaus collected and the scores rage from 300 to 850. Customers which have higher credit scores are deemed to be low risk which then typically results in receiving the lowest interest rates. Moreover, the credit reporting agencies maintains the information of millions individual reports as personal information, accounts, inquiries from lenders and negative items (myFICO, 2009). For instance, due to late payment, the FICO score is reduced. From such credit reports, the FICO score will be generate at a point in time. Thus, the score may change over time. In other words, if the raw data which is presented by the credit reporting agencies has changed, the score would be modified.

FICO credit score is mostly created by software of Fair Isaac Corporation. It provides guideline to determine the future by credit reports. Practically, such reports should contain information of at least one bank account which covers at least six mounts. Such information can support lenders to make decision about loaning such as interest rate, period of loaning term and approval credit. However, they cannot ensure to specify good or bad customers.

Addition to benefits to lenders, there are several advantages of FICO score to borrowers. The fist one is that borrowers can get loans faster because it can support lenders to easily consider who has score more than cut off level in a few minutes. Secondly, there are standard measurements without bias to approve credit instead of personal opinion. Finally, owing to snapshot information of credit reports, the borrowers who have problems with historical credits can improve their FICO score by current good payments.

## Unexpected loss in a Basel II context

Basel II is a new regulatory standard which was completely launched by Basel Committee on Banking Supervision (BCBS) which has headquartered at The Bank for International Settlements (BIS) in Basel, Switzerland to promote international monetary and financial operations for banks (Chorafas, 2004). It does not only indentify rule of estimating the amount of minimum regulatory capital required for ensuring the banks are enable to payback to depositors as Basel I but it also purposes to require banks to concern their risky which means to prepare sufficient capitals to support three risk categories, namely market, credit and operational risks. The Basel committee improved a framework of Internal Rating Based (IRB) to credit risk which incorporates Expected Loss (EL) and Unexpected Loss (UL) within IRB approach (Altman et al, 2004). In practical, UL is usually identified as standard deviation of the amount of credit losses that financial institutes or banks should predict on a portfolio in only single year.

One of the most difficulties in risk management is to identify the appropriate value of capital to cover unexpected loss which occasionally occurs without forecasting in banks and financial institutes. In particular, the model of estimation usually bases on Value at Risk (VaR) approach to measure credit risk and operational risk. In addition, this method intends to specify probability distribution of potential losses including EL and UL over time horizontal. Indeed, this framework establishes the appropriate level of capital covering unexpected loss and this level is referred to confident level which means probability of a financial institution will not go bankrupt or fail in some businesses. In practice, the confident level cannot be 100% because estimating losses does not perfectly distribute by using historical data. Therefore, the confidence level assigned to banks is closely perfect as 99.9%, 99% or 95%.

## Information Value of a Variable

Information is particularly used to predict credit scoring of customers for managing risk of loan. Especially, value of information has to be considered because it influence on decision making. Likewise, the information value (IV) of a variable is important for testing whether this variable has power to create valuable information for credit scoring.

For prediction, variables are significance of all processes because they are input or predictors which will produce the result of prediction. Next, the variable is divided into several groups such as less than 22 year, 23-40 year, 41-60 year and more than 60 age groups. Thus, the categories of variable are calculated for estimating the information value based on weight of evidence (woe) which is used to measures the risk of each category in a variable. For example, if age is a variable for scoring customers who ask to borrow money from banks, the borrowers who are young should be graded as low because they may normally have no income which leads to low power to payback. Therefore, scoring of risk as woe for young people is very low which means higher risk than other age groups. In contrast, form the other perspective, the bank can continually maintain them to be its own customers for long time.

Based on historical data and the information of woe, the next step is that all groups have to be evaluated by formula as in order to generate the information value of age variable. Subsequently, this value will be translated for meaning of power of prediction. According to role of thumb, the information values are categorised into four groups as unpredictive, week, medium and strong value of credit rating for predicting future payment (Baesens, 2010).

Addition to credit scoring in banks, this method can be adapted and collaborated with data mining in commercial companies to rate how their customers have ability to purchase their products and services. Then, the results will be assisted to create promotions to persuade specific groups.

## AUC based pruning

AUC which stands for area under the ROC curve is a scalar measure for performance of constructed classification. Indeed, ROC curve is receiver operating characteristic curve and it illustrates the relationship with sensitivity and specificity of classifiers.

The AUC can be presented with and without pruning. In practical, the result with pruning is usually better than the other (Ferri et al., 2002). For example, if researchers use decision trees for classification which has several possible alternatives on the leaves. The AUC will be used to estimate quality of each classifier. The higher value is better than the lower ones. In particular, the value of AUC should be more than 0.5 which means a good classifier. In case of the decision tree, there is some difficulty in how to stop growing the tree. Therefore, the decision tree should be pruned. Indeed, tainting set is usually utilized to grow the tree and then validation sample is used to scope the optimal size of tree. If the size of tree is larger, there is over-fitting for single data set. From the above example, the AUC which is based on the pruning decision trees should be better. In addition, the AUC can estimate the accurate values of each classifier.

Regarding procedure of input selection (Baesens, 2010), the AUC based pruning is one part of this procedure for classification. This method will start by using AUC of all variables from some technique such as logistic regress in order to measure performance of variables. Then, the variable which results in the highest AUC will be cut. All process will repeat until the values of AUC drop considerably. If the variables have high AUC value or the values are before significant decreasing of graph, it means that they are suitable variables for developing scorecard. This method can support selecting appropriate variables for estimation in term of credit scoring and data mining. In contrast, this method is time consuming because it has to calculate the AUC in many times for plotting graph until receiving the optimal result.

## Part 4

Regarding selecting the best classifier, there are several methods to measure each classifiers. Therefore, this section will illuminate such methods, namely confusion matrix, Kolmogorov-Smirnov, ROC, AUC and CAP curve.

Confusion matrix presents the result of classification based on a cut-off of 175 as followings:

TP (13)

FP (4)

## Defaulter

FN (5)

TN (8)

Table 3- Confusion matrix for cut-off of 175

Classification accuracy = (TP + TN) / (TP+FP+TN+FN) = 70%

Error rate = (FP + FN) / (TP+FP+TN+FN) = 30%

Sensitivity = TP / (TP + FN) = 0.722

Specificity = TN / (TN + FP) = 0.667

This section generates sensitivity and specificity of each possible cut-off by using Microsoft Excel as shown in table 4.

1

0

0.661

0.750

1

0.083

0.556

0.750

1

0.167

0.556

0.833

1

0.250

0.500

0.833

1

0.333

0.444

0.833

0.944

0.333

0.444

0.917

0.944

0.417

0.389

0.917

0.944

0.500

0.333

0.917

0.889

0.500

0.278

0.917

0.889

0.583

0.222

0.917

0.889

0.667

0.167

0.917

0.883

0.667

0.111

0.917

0.778

0.667

0.111

0.917

0.772

0.667

0.056

0.917

0.667

0.667

00.083

0.917

## 195

0.667

0.750

Table 4- Sensitivity and Specificity of selected cut-off

Based on the results of different cut-off, Kolmogorov-Smirnov curve can be generated as shown in the below graph.

Figure 6- Kolomogorov-Smirnov curve of selected cut-off

From the above graph, the maximum Kolmogorov-Smirnov distance will be calculated from each cut-off which was selected as shown in table 5. As a result, the max KS is 0.556 at the 145 cut-off.

0

0

0

0.389

0.750

0.361

0

0.083

0.083

0.444

0.750

0.306

0

0.167

0.167

0.444

0.833

0.389

0

0.250

0.250

0.500

0.833

0.333

0

0.333

0.333

0.556

0.833

0.278

0.056

0.333

0.278

0.556

0.917

0.361

0.056

0.417

0.361

0.611

0.917

0.306

0.056

0.500

0.444

0.667

0.917

0.250

0.111

0.500

0.389

0.722

0.917

0.195

0.111

0.583

0.472

0.778

0.917

0.139

0.111

0.667

0.556

0.833

0.917

0.083

0.167

0.667

0.500

0.889

0.917

0.028

0.222

0.667

0.445

0.889

1

0.111

0.278

0.667

0.389

0.944

1

0.056

0.333

0.667

0.333

1

1

0

## 195

0.333

0.750

0.417

Table 5- Kolomogorov-Smirnov distance of selected cut-off

The Receiver Operation Characteristic (ROC) curves base on selected cut-off. In addition, it shows to be better than a random scorecard because it is nearer top-left corner than the graph of random. Moreover, the results of selected cut-off by using SPSS show that the cut-off of 145 leads to the highest AUC which is 0.778 and it is more than 0.5. As a result, it is suitable for a good classifier.

Figure 7- ROC curves of selected cut-off and comparing with a random scorecard

The Cumulative Accuracy Profile (CAP) curve is drawn by percentage of bad based ob cut-off and it is compared with random scorecard. Then, the accuracy ratio (AR) or Gini coefficient is calculated by the formula as AR = 2*AUC-1. For the cut-off of 145, the AR is 0.556.

Scorecard

Random

Figure 8- CAP curves

The classifiers as cut-off seem to be point which predicts Good or Bad. Indeed, if the value of score is less than cut-off, it will be predict as Bad. Even though, it will be rated as Good. The actual and predicted rates will generate classification accuracy, error, sensitivity and specificity. Each cut-off results in different all of them and such results can be used to plot KS distance on ROC graph show the performance of classifier. The cut-off which have the highest KS distance means the best classifier of this data set. Similarly, the best cut-off is chose from comparing ROC curves of several selected cut-off and the performances of each classifier are presented by AUC. From KS, ROC and AUC, the best cut-off is 145. After developing model, the CAP can recheck how accuracy of model between default and total population.