# Introduction To The Mega Trend Diffuser Accounting Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Government or businesses may seek decision support systems to assist with setting strategies or making business related decisions. Naive Bayesian classi- fier (NBC) provides an efficient and easily understood classification model for data mining and machine learning needed to create a DSS. NBC is one way companies (government) can better understand its customers (people) through gathered data. In actuality, however, gathering sufficient datasets for constructing a DSS may be costly; resulting in unreliable understanding, leading to poor decisions and large losses. Furthermore, NBC is not without its drawbacks: (i) decreased classification accuracy when attributes exhibits non-independency, and (ii) unable to handle non- parametric continuous attributes. In this dissertation, mega-trend-diffusion (MTD) technique is introduced to address the costliness issue of gathering sufficient sam- ple; while also exploring proposed method to the problems of NBC, which includes structural improvement and discretisation. The goal is to look into the practicality of bayesian statistics in the real world; more specifically how a Bayesian approach can significantly contribute to better decision making in the marketing context.

Chapter 1

Introduction

Over the past twelve years Bayesian statistics have seen an increase in use for mar- keting purposes. Versatility of the Bayesian method have seen it being applied over a wide range of the marketing mix. This includes pricing decisions to promotional campaigns to diversification decisions. Although concepts of Bayesian statistics have long been recognised since its emergence in 1763, it was deemed as impractical in the marketing context up until mid-1980s. This is because the 'class of models for which the posterior could be computed were no larger than the class of models for which exact sampling results were available' - Rossi and Allenby [2003]. Furthermore, the bayesian approach requires the evaluation of a prior distribution, which can take a lot of effort, time and cost to establish. However, the recent decade have brought about an increase in computational power and modelling breakthrough which lead to a resurgence in use of Bayesian methods, particularly for the marketing field. Bayesian methods with its basic advantages makes it an attractive tool especially to facilitate decision making in marketing problems.

In this dissertation, we aim to learn a classification method that employs the Bayesian method that provides a logical way of taking into account the available information about the sub-sample associated with the different classes within the overall sample. More specifically, we will look into a simple probabilistic classifier that is based on the Bayesian Theorem with strong independence assumption called the naive Bayesian classifier. This independent feature model have seen wide appli- cations in marketing, such as for a procedure to analysing customer characteristics in the paper by Green [1964]. We will discuss in later chapters how this classifier is constructed and see how it can be applied in a marketing context for pricing decision.

1

Firms and governments seek decision support systems (DSS) to assist in making decision and setting strategies. To establish an effective DSS, also known as business intelligence, data mining techniques and machine learning is required as core functions. However in rapidly shifting markets and an ever demanding popu- lation, decision makers are pressured to make decisions within very short period of time. Classifiers in the machine learning process can only function well when suffi- cient data is collected for training. This leads to the problem of collecting adequate sample sets for an informative DSS. To address the issue of extra cost commonly attributed to assessing a prior distribution in the Bayesian method, this dissertation also introduces and studies a technique proposed by Li, Wu, Tsai, and Lina [2007a] called the mega-trend diffusion technique. This proposed technique applied to small collected data sets extracts more useful information that can then be used for the process of classification learning. Related research that uses the MTD technique for classification learning have shown promising results regarding the classification performance. We hope to learn more about this technique and how it can be used with the NBC in this dissertation.

The methods being studied in this paper is demonstrated using a case in- volving a decision to build additional daycare centres in district in the UK. This dissertation begins by introducing the naive Bayesian classifier in Chapter 2, where a simple naive bays classifier example is presented as a segue to the formal definition of the NBC and problems related to the classifier. Chapter 3 introduces the mega- trend diffuser technique to address the issue of insufficient sample data, and the experimental study using the proposed methods on the case of building additional daycare centres is described in Chapter 4. Lastly, concluding remarks are given in Chapter 5.

2

Chapter 2

Naive Bayesian Classifier

2.1 What are classifiers?

By definition, a class is a set or category of members that have properties or char- acteristics that are in common and are differentiated from others by these charac- teristics.

A classifier can be some set of rules or an algorithm that implements clas- sification. It is a function that maps data (which exhibits attributes) to categories (or classes). A few examples of the myriad classification methods includes: Sim- ple linear classifier, Logistic Regression, Naive Bayes' Classifiers, Nearest Neighbour Classifier, Decision Tree Classifier, etc.

Classifiers are extensively used for data mining; with applications in credit approval, fraud detection, medical diagnosis, marketing (which we will see in this paper), etc. For example, classifiers are used to help identify characteristics of prof- itable and unprofitable product lines [Amat, 2002], in product development decision [Nadkarni and Shenoy, 2001] and in conglomerate diversification decision [Li et al., 2009a], etc.

A graphical interpretation

Here is a simple explanation of the terminologies and the idea of classification through these three points: profiling, segmentation and classification [Amat, 2002, Section 3].

3

â€¢ Profiling involves the identification of attributes pertaining to the category (or class).

ï£±ï£´ Attributes :-

ï£´ 1) No. of Wheels: 4 ï£² 2) Color: Blue

ï£´ 3) Windows: Yes ï£´

## ï£´.

## ï£³ .

Figure 2.1: Profiling.

â€¢ Segmentation is the process of identifying classes in a given data. (Figure 2.2) In this example we are segmenting vehicles into categories, we map {Bicycle, Tricycle, Car} â†’ {1,2,3} for ease of computation.

Figure 2.2: Segmentation.

â€¢ Classification is the process of assigning a given (new) input sample to one of the existing category. A classifier is a function that does this. A crucial point is that classification assumes a segmentation already exist through a learning algorithm1. Here (Figure 2.3), new data entry of a red bicycle with its key attribute of "2 wheels" is classified to Category 1.

1see Dietterich [1998] for more on learning algorithms.

4

## ï¿¼ï¿¼ï¿¼

ï¿¼Figure 2.3: Classification.

2.2 Bayesian statistics (a casual introduction)

Reverend Thomas Bayes (ca. 1702-1761) from London, England is a mathemati- cian and Presbyterian minister; whose legacy lives on in the renown theorem that bears his name: Baye's Theorem. It is believe to have stemmed from the probability paper, "Essay Towards Solving a Problem in the Doctrine of Chances", published in 1763 by Richard Pierce [Marin and Robert, 2007]. Now, Bayesian theorem serves as a fundamental element to statistical inferencing and statistical modelling, which have applications ranging from market analysis to genetic research.

Essentially, Bayesian statistics provides a rational method of updating be- liefs (probabilities) in light of new evidence (observations). It coherently combines information from different sources using conditional probabilities through Bayes' rule:

P(Cj|A) = P(Aô°ƒCj) (2.1) P (A)

P(Cj)P(A | Cj)

= ô°ni=1 P(A | Ci)P(Ci) (2.2)

Here we have the posterior distribution P(Cj|A), the prior distribution P(Cj) and likelihood P(A|Cj). Note that the denominator P(A) is an invariant across classes (i.e. constant, so does not depend on Cj), hence it can be dropped as it does not effect the end results of classification, therefore:

P(Cj|A) âˆ P(Cj)P(A|Cj) (2.3) Say Cj here refers to the Class j (or category j) and A refers to the set of attributes

5

## ï¿¼ï¿¼

related to the Class (i.e. it can refer to a vector of attributes A = {A1,...,Ak}). Then Bayes' rule allows us to update our initial beliefs (i.e. the prior P(Cj)) about the Class j, by combining it with new information we gathered about the attributes A that are related to the Class j. Resulting in the new belief of Class j, expressed through the posterior P(Cj|A).

2.3 Why Naive?

With the assumption of the features (attributes) being independent given its class, an NBC greatly simplifies the process of learning. Here, it is explained very simply. Let C be a binary random variable where

ô°€1 Class 1

C=

0 Class 2

and A1,...,Ak be the set of predictor variables (i.e. attributes). The key point here is the simplistic (naive) assumption that if the predictors are conditionally independent given C, such that the joint conditional probabilities can be written as

k

P (A1, ..., Ak|C) = ô°… P (Ai|C)

i=1

then combining this with Bayesian Theorem (2.5), will lead to the Naive Bayesian

Classifier (here presented as odds)

P(C = 1|A1,...,Ak) = P(C = 1) ô°…k P(ai|C = 1)

ï¿¼ï¿¼ï¿¼log gives

P(C = 0|A1,...,Ak) P(C = 0) i=1 P(ai|C = 0)

P(C = 1|A1,...,Ak) P(C = 1) ô°„k f (ai|C = 1) log =log + log

ï¿¼ï¿¼ï¿¼P(C = 0|A1,...,Ak) P(C = 0) i=1 f (ai|C = 0)

where f (ai|C) is the conditional density

of Ai. Figure 2.4 visualises the structure

of the NBC.

Naive Bayes' classifiers are probabilistic, meaning it chooses the class with the highest probability given the (new) data. It can also be understood as NBC se- lecting the most likely class Classk,D given the attributes a1, ..., ak that best matches

6

ï¿¼Class

A1 A2 A3 ... Ak

Figure 2.4: Structure of Naive Bayes Classifier. Notice the absence of arcs between

ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼attribute nodes signifying independency given class. training set D 2:

k Classk,D =arg max P(c)ô°…P(ai|c)

(2.4)

câˆˆClass

i

P (C = c) and P (Ai = ai|C = c) needs to be estimated from a training data3. The conditional (probability) densities can be estimated separately by means of non- parametric univariate density estimation; and in this paper, m-estimation [Cussens, 1993; Cestnik, 1990] is implemented in the example below. This would avoid joint density estimation which is highly undesirable especially when the model have a large number of predictors. Moreover, the fact that the densities can be estimated non-parametrically; allows NBC to have flexible and unrestricted modelling of the relationship between attributes (Ai) and class (C) [Larsen, 2005].

The likelihood P (ai|c) is generally a Bayesian estimation. Implementation of m-estimation [Cussens, 1993; Cestnik, 1990] is used estimate this probability, where the prior is constraint to a beta distribution with m as the spread or variance of the distribution.

P(ai|c)= kc +mÂ·P(ai) n+m

2i.e. supervised learning [ref. 10] with data D such that segmentation (page 4) exist. 3a set of data with known classes, that are normally provided.

7

(2.5)

## ï¿¼ï¿¼

where

n := number of training sample where Class = c

kc := number of training sample where Class = c and a = ai

P (ai) := priori estimate of P (ai|C = c)

m := an equivalent sample size (see Cussens [1993] for more on m-estimate)

Here, the posterior distribution is then a beta distribution with the new variance parameter of n + m [Cussens, 1993].

2.3.1 A simple example

Say, for example, a motor insurance company is deciding whether to charge more or less premium to a client owning a particular Gray BMW MPV car; based on the class of the car possibly being stolen or not. (Note that the data set does not contain such a car.)

i) The data set

Class

Table 2.1: Example training data set for NBC model. [Meisner, 2003]

ii) Using m-estimation

To calculate the posterior (2.4) we need the conditional probabilities of attributes Gray, BMW and MPV each conditioned on the two classes. That is P (Gray|Yes), P (BMW|Yes), P (MPV|Yes), P (Gray|No), P (BMW|No) and P (MPV|No). Then

8

ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼Attribute 1 Attribute 2 Attribute 3

Brand Color Type

ï¿¼Toyota Gray Toyota Blue Toyota Blue

BMW Gray BMW Gray BMW Gray Toyota Gray

Toyota Blue BMW Blue BMW Blue

Sports Sports MPV Sports Sports Sports MPV MPV MPV Sports

## ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼

multiplying these conditional probabilities with P (Yes) and P (No) respectively by Bayes' rule to obtain the posterior probabilities.

From the data set, there are 5 entries where cj = Yes; and in 2 of these entries a1 = BMW. So for P (BMW|Yes) this implies n = 5 and kc = 2. We construct the following table for easy referencing:

2.4 Problems reaching an optimal NBC.

Despite having the advantages of being fast to train and classify, robust to noisy data [Yang and Webb, 2002] and not sensitive to irrelevant attributes; NBC have its drawbacks for its (i) inability of handling non-parametric continuous attributes, and

11

(ii) decrease in classification accuracy when attributes exhibits non-independancy.

In the proceedings, Martinez-Arroyo and Sucar [2006] proposed two methods which deals with these two issues. These methods, namely discretisation and structural improvement, reduces classification error and leads to an optimal NBC. In this paper, some of the method are implemented later and combined with mega-trend diffuser in data analysis.

2.4.1 Discretisation

An attribute can take either categorical (Yes or No) or numerical (3.456, etc.) value. For a categorical (class) attribute, the values are discrete and hence can be used to train classifiers without the need of any modification. On the other hand, values of numerical attributes can be either discrete or continuous [Yang and Webb, 2002; Johnson, 2009]. The numerical attributes are converted, or discretised, to a categor- ical one irrespective of the numerical attribute being discrete or continuous. This helps improve the performance of classification as this preprocessing by discretisa- tion allows numeric attributes to assume a normal distribution. Meaning for each numerical attribute Ai, a categorical attribute Aâˆ-i is created, with the value of each Aâˆ-i corresponding to an interval (xi,yi] of Ai. Then Aâˆ-i is used for training the classifier, instead of Ai.

Information which are subject to continuous measurements can be placed into discrete classes for convenience [Green, 1964]. When attributes were discre- tised, performance of the NBC is found on average to slightly outperform other classification algorithm such as C4.54 [Dougherty et al., 1995]. This is due to the fact that attributes that are discretised maximises class prediction. Hence, a sim- pler classification model can be achieved and irrelevant attributes will present itself (though NBC still performs well with irrelevant attribute, its unnecessary presence only takes up small computational power).

A paper that have comparative studies of a number of discretisation methods for NBC can be found by Yang and Webb [2002]. In which, comparative studies are done for nine discretisation methods such as, fuzzy discretisation (FD) , lazy discretisation (LD), weighted proportional k-interval discretisation (WPKID) etc.

4a statistical classifier that generates decision trees for classification. C5.0 is an improvement of C4.5, but it is commercial and have not yet been extensively use or compared with in many research, unlike C4.5.

12

## ï¿¼

Yang and Webb [2002] also propose a new discretisation method called weighted non-disjoint discretisation (WNDD) which is a combination of other discretisation method that reduces classification error.

However, Yang and Webb [2002, Section 3] argues that discretisation meth-

ods of pure intervals might not be suitable for NBC because the NBC already assumes conditional independency between attributes; and hence does not use at- tributes combination as predictors. It is suggested that the categorical Aâˆ-i be substi- tuted for the numerical Ai to result in an accurate estimation of P (C = c|A = {a1, ..., ak}). Keeping this in mind, in this dissertation the discretisation method called Minimum Description Length (MDL) principle, as suggested and used by Martinez-Arroyo and Sucar [2006] in their paper. The MDL principle will be employed to preprocess data which is used construct a more optimal NBC. A brief introduction to MDL principle

will be stated here, whereas its usage in this paper can be found in the later chapters.

A simple explanation for MDL principle can be noted from this quote: '[The MDL Principle] is based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally.' - Gru Ìˆnwald [2005]. Introduced in 1978 by Jorma Rissanen, the MDL principle is a strong method of inductive inference that plays an important concept for information theory, pattern classification, and ma- chine learning. Being especially applicable for handling predictions and estimation problems, particularly with situations where the considered models may be complex such that over fitting of the data is a matter of concern; we will employ this method as a basis for the discretisation method used for this research.

However, this paper will not go into great details about the MDL principle as it forms another topic of its own. Readers who are interested in comprehen- sive derivation of this method can find them in the article by Gru Ìˆnwald [2005]. In essence, we learn the data using a score based on the MDP principle, and use them for building our NBC. Note that from now on, MDLP is synonymous with MDL Principle.

In R the package "discretization" contains such a function to implement MDLP and it can be called in the programme using the following command lines:

> install.packages("discretization")

> library(discretization)

13

> mdlp(data)

Where data here is the dataset matrix to be discretised.

2.4.2 Structural Improvement

Although in this paper structural improvement will not be used to modify the NBC in the data analysis section, for reason apparent later, it will be highlighted briefly to introduce the reader to the plausibility of constructing a classifier which deals with dependent attributes.

As highlighted in Chapter 2, the NBC assumes the attributes to be inde- pendent from each other given the class. In reality, this may not always be true. There are two workarounds to this, as Martinez-Arroyo and Sucar [2006] brought to point. The first alternative is to connect the dependent attributes with directed arcs. This would lead to a NBC Extension that is the Bayesian Network Classifiers (BNC) [Baesens et al., 2004]. In the paper by Ong, Khoo, and Saw [2012], they proposed a similar method by structural learning using a hill-climbing algorithm; that was used to identify dependancies and causal relationships between variables. They also included results from accuracy test of the modified classifier. Figure 2.5 is an example that visualises this workaround.

Class

A1 A2 A3 ... Ak

Figure 2.5: Modified naive Bayes classifier with arc introduced between attribute

A2 and A3 to signify dependancy or causal relationship.

However a disadvantage is that this results in the lost of the simplistic mod- elling of a NBC, in exchange of a more complicated BNC. The second method is to transform the structure but still

maintaining a NBC structured network. Three ba- sic operations can be used here [Martinez-Arroyo and Sucar, 2006; Succar, 1991]: 1) attribute elimination, 2) merging two or more dependent attributes, 3) introducing an additional attribute making two dependent attributes independent (as a hidden

14

## ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼

node). These (except 3) operations are illustrated in Figure 2.6 and 2.7.

Class

A1 A2 ... Akâˆ’1 Figure 2.6: NBC with attribute A3 eliminated.

Class

A1,A2 A3 ... Ak Figure 2.7: NBC with two attributes combined into one variable.

In this alternative, elimination of superfluous attributes is done if those at- tributes is seen to be below a threshold and hence does not provide any additional mutual information between the attribute and class. Then the other attributes are examined for conditional mutual information (CMI) [Fleuret, 2004] given its class for each pair of attributes. High CMI value implies dependency; and these one of these attributes are either eliminated or merged to form a single attribute [Martinez-Arroyo and Sucar, 2006]. This is repeated until there exist no unnecessary or dependent attributes.

These two methods can be used after the preprocessing by discretisation. In this paper, the dataset obtain is assumed to have non dependable variables and hence will not require structural improvement for the NBC used.

15

## ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼

2.5 Formal definition

In the previous sections and subsections, the naive Bayes classifier was informally introduced. In this section we formally defined the NBC, as in Rish [2001].

Let A = (A1,...,Ai) be a vector of observed random variables and call it attributes. Each attribute have values from the domain Di. Then the set of all at- tribute vectors is Î© = D1 Ã- ... Ã- Dn. Let the class be denoted by C, an unobserved random variable. C can take one of k values, i.e. c âˆˆ {0, ..., k âˆ’ 1} . Let capital letters denote the variables, while lower-cases denote the corresponding values (i.e. Ai = ai). Also, as per convention, bold letters denotes vectors.

A classifier is defined as a function h(a) : Î© â†’ {0, ..., k âˆ’ 1} that assigns a class c to given data with attribute vector a, where h(a) = C. Commonly each class j is associated with a discriminant function fj (a) where j = 0, ..., k âˆ’ 1 where the classifier will select the class with the maximum discriminant function for a given sample such that h(a) = arg maxjâˆˆ{0,...,kâˆ’1} fj (a).

The deviation from a normal classifier is that a Bayesian classifier uses the

class posterior probabilities as the discriminant function, meaning if we denote a

Bayesian classifier as hâˆ-(a) then fjâˆ-(a) = P(C = j|A = a). Using the Bayes rule we

get that P(C = i|A = a) = P(A=a|C=j)P(C=j) and as before, we can ignore P(A = a) P(A=a)

as it is a constant to give fjâˆ-(a) = P(A = a|C = j)P(C = j). This implies that the Bayesian classifier hâˆ-(a) = arg maxjâˆˆ{0,...,kâˆ’1} P(A = a|C = j)P(C = j) looks for the maximum a-posterior probability hypothesis given the sample attributes a.

Unfortunately, as we dwell into higher dimensional feature space, directly

estimating the class conditional probability distribute P(A = a|C = j) from a given

sample set becomes tedious or difficult. Thus, we assume for simplification that

the features are independent given the class. This approximation for the Bayesian

classifier yields the Naive Bayesian Classifier with discriminant function fNB(a) = j

ô°‚ni P(Ai = ai|C = j)P(C = j). Hence we have the naive Bayesian classifier that finds the maximum a-posterior probability given the sample attributes a, with the assumption that the attributes are independent given the class.

n

hNB(a) = arg max ô°…P(Ai = ai|C = j)P(C = j) (2.6)

jâˆˆ{0,...,kâˆ’1}

i

16

## ï¿¼

Chapter 3 Mega-Trend Diffuser

3.1 Introduction to the MTD

Insufficient data in business intelligence breeds uncertain knowledge which can lead to poor decision making and potentially high losses for a company or an institution. To make matters worse, collecting sufficient data can incur large expenses to a com- pany in terms of time and money. In some cases, gaining enough real data is not always possible. Consequently the obtainable data may very often be incomplete as a training set for classifiers (or other models for that matter).

In this chapter, a proposed method by Li, Wu, Tsai, and Lina [2007a] called the mega-trend diffusion (MTD) technique is introduced as a tool to address the issue of insufficient data for training classification models. For our case, it is to training the naive Bayes classifier. In Li et al. [2007a] the MTD is used to produce artificial samples from small datasets to aid the learning of a modified back propaga- tion neural network (BPNN) for a early manufacturing system. The results obtained by Li et al. [2007a] were promising as the artificial sample set, in addition to the available training set, significantly improved the learning accuracy in the simulation model; even with very small dataset.

Li, Lin, and Huang [2009a] also explored the construction of a marketing decision support system (DSS) using the MTD in a case study of gas station di- versification in Taiwan. In their case study, the MTD is used in conjunction with the BPNN and Bayesian network (BN). The MTD explored additional hidden data related information which were not explicitly available from the dataset itself. This allowed the constructing of a flexible and informative DSS given a small dataset,

17

making it possible for marketing managers to have a better overview of the market; to find possible niche markets they can venture into while avoiding unprofitable ones.

Two setbacks of having small datasets are the gaps of spares data and iden- tifying a trend. The MTD addresses this issues by filling the gaps and estimating the trend in the data. It does this by assuming the artificial samples selected are located within a certain range and have possibility indexes that are calculated from a membership function (MF) [Li et al., 2009b]. Simply put, this method extracts more information from available data by creating new relevant attributes. In this paper, we venture into the possibility of integrating the MTD with NBC to form a classification model which has reduced estimation errors (hence better forecasting precision) for cases where the collected data is insufficient.

3.2

Constructing the MTD

1 Membership Function Membership Value

a min Î¼set max b

Figure 3.1: The mega-trend diffusion (MTD) technique.

ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼In the MTD technique, the expected population mean is assumed to be located between the minimum and maximum of the sample dataset. Then, it is natural to expect the true population mean to be located within a wider boundary (a, b), which is bigger than the boundaries [min, max] of the collected data, as shown in Figure 3.1. Consequently, a sufficient sample set would have data-points which are distributed within this larger range of (a, b). Parameters a and b are calculated using diffusion functions defined below.

3.2.1 Defining parameters a and b

In the first step of implementing the MTD, let min and max be defined as the min- imum and maximum values of the dataset respectively. Then Î¼set is the average of the min and max. NL and NR refers to the number of data smaller and larger than Î¼set respectively; implying the dataset have NR +NR data-points. Then SkewL and

18

SkewR are the ratio of data-points that are smaller than and larger than or equal to Î¼set respectively. In MTD technique, this method is a simplified way of calcu- lating the skewness of the data in distribution. Moreover the collected datasets are usually small and a sample regression line is probably unreliable [Li et al., 2007a]. We then have parameters a and b, the possible lower limit and the possible upper limit respectively, calculated as follow:

ï¿¼Equations 3.1 and 3.2 are known as the diffusion functions for setting a and b, respectively. The natural logarithm ln (âˆ-) signifies the plausibility of the population mean being located within the boundary (a,b); while 10âˆ’20 is a very small value implying how impossible it is for the mean to be located at the boundaries [Li et al.,

19

2009a]. Note that using 1 imply the most likely point, like Î¼set. 0 cannot be used as negative infinity would make the argument invalid. Another point to note is that as NL and NR tends to âˆž (i.e. implying that sufficient data has been collected), value of a will be more than min while b will be less than max. Intuitively this is incorrect, as having sufficient data we would expect a larger diffused range (a,b). Hence if this occurs, values of a and b are assigned to min and max respectively, as in equation 3.1 and 3.2.

3.2.2 Defining the membership function

1

a Bi Î¼set b

Figure 3.2: The triangle membership function. Here membership value Mi w.r.t

random number Bi reflects the possibility of Bi occurring.

The MTD technique assumes that actual data is likely to be distributed in the (a, b) range. So the next step in the technique is to create artificial datasets by randomly sampling from this wider bounded range. This random sampling would add additional data, a random number Bi, to the collected dataset to increase the sample size. This random number between a and b is selected from a uniform distribution. A unique feature of the MTD technique is that a membership function (MF) is employed to make the additional data more useful. This MF produces the membership values that indicates the possibility of Bi being the true mean. In this paper, for simplicity we are using a triangle-shaped function, as shown in Figure 3.1, however the MF can be any other uni-model function that shows decreasing possibility [Li et al., 2009a]. Let the membership function Mi of the random number Bi be defined as:

## ï¿¼ï¿¼

A point to note is that each class j has its own MF for each attribute i. This is clearly represented in Figure 3.3. Where Mij is the MF of class j with respect

to attribute i. This then brings about the question of overlapping of the area under the MF of each class with respect to the same attribute. Though in this paper this issue of overlap is not addressed in our case study, because the additional steps makes processing more complicated. Here we explain very simply the meaning of high and low overlaps. If the area of the overlapping of two MF function of two classes in attribute i is low, this implies that attribute i is an informative classification index [Li and Liu, 2012] as any datapoint with attribute i can easily be classified to the correct class. Conversely, if the overlapping area is high, the chances of classifying the datapoint to the correct class is decreased. Should the reader wish to learn more about the issue, the paper by Li and Liu [2012] explains the problems faced and proposed solutions to construct attributes in the MTD technique; and also includes result of the modified MTD technique on difference classifiers.

Membership Value

1

attribute i.

Coming back to the MTD technique, after defining the wider range (a, b) and the MF for each class j with respect to attribute i, these virtual sets (synonymous with artificial set) are constructed by generating more random numbers Bi for each class j using their respective MF as explained above. The number of virtual samples needed may vary for different classification methods and requires further research (generally, 100 artificial samples woulds suffice to train the NBC classifier). After constructing the virtual set and its corresponding membership values, the member- ship value of the training set is then computed. Then the virtual sets and training sets, along with their membership values, is used to train the NBC.

21

ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼ï¿¼Class 1

Mi1

ï¿¼ï¿¼ï¿¼Class 2

Mi2 22

b Attributei Figure 3.3: The membership function Mi1 and Mi2 of class 1 and 2 respectively in

ï¿¼a1 Î¼Ai a2 b Î¼Ai

Chapter 4

The case of building additional Day Care Centres

4.1 Introduction to data

The data used in this study is from the module IB3A70 The Practice of Operational Research, which is provided by the County Councils Network (CCN), a Special Interest Group within the Local Government Association (LGA). The CCN is re- sponsible to make sure that funds from the government are distributed fairly among the District Authorities to facilitated day to day services. They are particularly keen to address districts which are prone to have high demand for delivery of their services. Their services may include transportation service between houses and day care centres, day care services in patients' home and also services in the centres itself.

They wish to be able to identify districts that will potentially be in high demand for their day care services, from surveys and studies done on the popula- tion of a particular district. The idea is to build additional day care centres, where required, in these potential districts so as to prevent a backlog when a surge of demand occurs, as this would result in even higher cost, overworked workers and unhappy population. The usual problems faced is that collecting sufficient data of the districts for every districts in the country requires a lot of man-hours and effort. To help tackle this issue and help them in their decision process, this paper suggest the use of statistical methods such as the naive Bayes classifiers and mega-trend diffusion technique for their decision support system.

The data provides a sample of 72 districts in the UK, out of the total 352

22

districts, which we are interested in classifying the districts based on the attributes that corresponds to the demand for day care service. To date, the 72 districts have been through surveys and customer reviews that lead to them being categorised as districts with low, neutral or high demand for day care service. The CCN hopes to be able to classify the other 281 districts with the limited amount of information they are able to collect with the limited amount of time they are allocated, so that immediate action can be taken as building more daycare centres can take months.

This paper aims to employ what has been learnt in the last few chapters on naive Bayesian classifiers, discretisation using minimum description length principle and the mega-trend diffuser technique. In this chapter, the first section 4.2 will go through the methodology used to implement the NBC, MDL principle and MTD for the data set. Then in the next section 4.3, the computed results are presented and discussed. Lastly, some issues that were encountered during this research is highlighted in the final section 4.4.

4.2 Methodology

This paper has thus far proposed naive Bayesian classifier as a classification model to be used as an aid in marketing decision process of companies and or governments. We also introduced the MTD technology as a solution to the problem of limited sample sets, by increasing sample sizes through virtual sets; while also doubling the number of attributes by including corresponding membership values for each attributes. The generated sets in turn is used to train the NBC. Additionally, we considered the discretisation of continuous variables using the MDL principle as an intermediary step in improving the NBC. Thus these were following steps taken in our data analysis:

1. Initialisation: (a) Importing data. (b) Discarding redundant factors. (c) Load- ing NBC and MDL packages in R. (d) Defining functions for MTD in R.

2. Raw Evaluation: (a) Define the training set. (b) Use training set to train the NBC. (c) Test NBC on full dataset and display results.

3. Implement MDLP and evaluate: (a) Discretise training set using MDLP (b) Use discretised training set to train the NBC. (c) Test NBC on full dataset and display results.

23

4. Implement MTD and evaluate: (a) Create virtual samples from training set using MTD technique (b) Use virtual sets and training set with their mem- bership values to train NBC. (c) Test NBC on full dataset and display results.

5. Implement MTD then apply MDLP and evaluate: (a) Create virtual samples from training set using MTD technique (b) Discretise both virtual and training sets using MDLP (b) Use discretised virtual sets and training set with their membership values to train NBC. (c) Test NBC on full dataset and display results.

In the following subsections, we describe the above steps in detail.

4.2.1 Initialisation

This first step is to import the data set into the application R for building the classifier. In total, the data that have been collected containsentries for 72 districts with 5 attributes and a corresponding class (excluding the district names). These 5 attributes are believed by the CCN to be the prominent predictors for the demand of a district. We considered 20 random data points to be the training set for the classifier and the remaining 41 data points as the test set to study the effectiveness of the classifier.

The original data collected included important as well as redundant details of each district. Through rational analysis and discussions with the decision makers [French et al., 1997], redundant factors are identified and discarded, then in Table 4.1 we list the attributes that are believed to be the best tell tale signs of a district potentially having a high, neutral or low demand for day care service.

Attribute Description

Demand Indicates of the potential demand of a district. (Class)

A1 Elderly Population 65+

A2 Sum of Radial Distance

A3 Sum of Nearest Neighbour distance

A4 Daycare calls per day

A5 Potential Clients Need Index

Table 4.1: List of main factors found in investigation which helps identify potential demand for day care services for each district.

24

## ï¿¼ï¿¼ï¿¼

An abstract of the 20 data training set from the collected data for this case is shown in Table 4.2. The dataset contains 5 continuous numerical attributes, including A1, ..., A5. The training set also includes a characteristic attribute "Service Demand" (or Demand) that measure the potential demand for daycare services of a particular district; High meaning high potential demand, Low means low potential demand, and Neutral means neither high or low demand is expected. Since the CCN is paying attention to the potential demand of a district, we will define "Demand"

as the classification variable; while the other attributes as the variables.

input or predictor

In this paper, the statistical program R is used throughout for most parts

of the calculation and formulation of the NBC, MPL principle and MTD. Software packages for the naive Bayesian classifier as well as the minimum description length principle can be found online at http://cran.r-project.org/web/packages/e1071/index.html (e1071 package) and http://cran.r-project.org/web/packages/discretization/index.html (discretisation package), respectively. Since they are readily available, it is un- necessary to code new functions in R for the NBC or MDL principle as the available functions are sufficient. As mentioned before in Chapter 2, these packages are loaded using the following command lines in R:

The mega-trend diffuser technique is relatively new and has yet to be widely used for training NBC with small data sets, particularly for marketing decisions

25

or purposes, and there is no freely available packages for R online at the moment. Hence, new functions have to be coded in R using algorithms for MTD technique (more in the next subsection).

Now that we have prepared the data, training sets and relevant functions required to run the NBC, MTD and MDL principle in R, we can proceed to build the NBC using the training set. We study the effectiveness of the NBC on the full dataset when the NBC is trained using four different approach: (1) Raw and unmodified training set is used to trian the NBC. (2) Only MDLP is applied to discretise the training set and this is used to train the NBC.

(3) Only the MTD technique is used on the training set to create artificial sets and new attributes (membership values) that is then used to brian the NBC. (4) Both MTD and MDLP is implemented; where MTD is first used to create virtual sets and new attributes, then MDLP is applied on the resulting training set which is then used to train the NBC.

4.2.2 (1) Raw evaluation using NBC

In this step the 20 randomly chosen training data is used to train the NBC, without any manipulation of the data. The training data train.dcs.dat is input into the function naiveBayes which computes the conditional posterior probabilities of the class variable (Service Demand) given the predictor variables (A1, ..., A5) using Bayes rule (2.5). In Table 4.2 we see the training set with three class subsets: high, neutral, low. The function predict can be called to predict the class of the data points in the training and test data in dcs.dat (contains 72 entries). The output is a matrix which displays the number of data points that have been predicted by the NBC and corresponds to the actual resulting class, the experimental results are presented in section 4.3.

For this step, the training data is modified using the MDL principle. The idea in MDL principle is that regularity found in the data is used to compress the data [Gru Ìˆnwald, 2005]. MDLP simply states that the best hypnotises is the one with minimal description length [Kotsiantis and Kanellopoulos, 2006]. One may find

26

entropy minimisation discretisation (EMD) method synonymous with the minimum description length principle. This is because MDL principle being a entropy based method that uses binary discretisation; determined by choosing a cut point for which the entropy is minimum. The method considers a large interval that contains all the known values of an attribute and binary discretisation is applied recursively to partition this interval into smaller sub-intervals, always selecting the cut point with minimum entropy, until a stopping criterion such as the MDLP (or after achieving an optimal number of intervals). The MDL measure we are using is incorporated in the function mdlp in R. This function is called upon to discretise the continued attributes in the training data matrix "train.dcs.dat" using the entropy criterion; where the minimum description length criterion is used as the stopping rule to halt the discretisation process [Yang and Webb, 2002]. Table 4.3 shows the discretised continuous attribute in the training set. The resulting discretised training dataset is then used for classification learning of the NBC, as before.

The availability of the mdlp function is convenient, as we do not need to write a new function to run discretisation. However, the draw back to using a

27

readily available software package is that we are limited in flexibility. The function does not allow users to dictate the number of bins that the MDLP algorithm splits the data into, this may pose an issue in our study and will be highlighted in the later section. A walk around may be to write new functions that would fit our needs. However, this may be time consuming and we are assuming that the mdlp function available is sufficient for this paper. Further research may be required to devise new functions that are tailored to specific requirements.

4.2.4 (3) Implementing MTD only

For this step, we implemented the MTD technique described in Chapter 3 on the training data and without discretisation. As in Chapter 3, we require the estima- tion of the domain range (a, b) of each attribute with respect to each class, so that ns number of samples can be randomly produced within this range to be defined as the virtual set. However, it is not correct to train the NBC directly with these virtual sets because they are randomly chosen (or from uniform distribution) and does not represent the complete information of the attribute data distribution [Li et al., 2007a]. Hence, we require a membership function (or diffusion function) Mi which calculations for each of the random samples its corresponding membership value that reflects the significance of each randomly chosen sample. In addition to the virtual sets, membership values are also calculated for the training data. The detailed steps for calculating domain range (a, b) and the membership function Mi can be found in Chapter 3 and hence will not be reiterated here.

As highlighted above, the MTD technique is relatively new and therefore no freely available software package for R can be found that runs the algorithm of the MTD technique. Hence, we are required to code our own function that implements the algorithm for the MTD technique, as explained in Chapter 3. The formulation of the main functions that were coded for use in this paper can be found in Appendix A. These functions are:

> MTD.mem.for.training.dat.w.class(data)

> MTD.final(data, gen.num=1)

Where data is the data.frame matrix that we wish to apply the MTD technique to, which will generate membership values for corresponding attributes. While gen.num is the number of virtual data points ns that will be generate for each class in the data.frame, with corresponding membership value. The difference between the two functions is that the first function will not generate additional virtual data points,

28

where as the second function will.

Though further research is required, from various published papers and through experimentation, about 100 virtual samples for each class would suffice to train the NBC. Consequently, we generate 100 random artificial data along with its corresponding membership values. The membership value of the training sets are also calculated and both the training sets and virtual sets are then input into the NBC function for classification learning. Once done, we produce a treble of the prediction results and compare with actual class, as before. An abstract of the created virtual dataset can be seen in Table 4.4.

4.2.5 (4) Implementing MTD then applying MDLP

This step combines steps 3 and 4 of applying the MTD technique on the training set and then using the MDL principle to discretise the data. The resulting modified data is then used for classification learning as in step 2. Note here that because the membership values M V Ai computed are in the range [0, 1], we do no apply dis- cretisation on such a small interval as the discretisation will make these membership values redundant. In order to compute the desired results, we need to only discretise the continuous attribute Ai and leave the membership values MVAi unmodified. To compute this step, we use the help of the function zipFastener found online [Appendix B] and write a function mdlp.dat.only.

In the original dataset, there are 25 data points that belongs to the Class High, 22 data points to Class Low, and 25 data points to Class Neutral. The results shows a slight improvement of the classification performance when discretisation of the variables is applied and the mega-trend diffusion technique is utilised. The av- erage accuracy of the NBC without any modification of the attributes or additional virtual datasets is found to be 79.45%. The average accuracy once discretisation by MDLP is applied increased to 80.96% while applying MTD technique results in average accuracy of 80.24%. Base this experimental results, the CCN can use the NBC to classify other districts in the UK to identify areas where they should take immediate action to build more daycare centres or provide with additional staff, etc. before the demand escalates, leading to difficulties satisfying them. The MDLP technique seems to be useful when the data collected is numerous or complicated, as the MDLP would compress the data to manageable scale for training the NBC. Not only will it does this quicken the model build time, it also helps improve the classification accuracy somewhat. Additionally we have seem that the MTD tech- nique have improved the NBC compared to not applying the MTD. The CCN will find the MTD technique a useful tool when it comes making decision based limited

sample data, in this and other problems.

Kaya [2008] studied popular discretisation techniques in his paper, one of which involves a similar entropy method to MDLP used in this dissertation. Like our paper, Kaya [2008] applied the discretisation techniques on naive Bayes classifier and also found that the model build time and overall classification accuracy showed significant gains.

The paper by Li, Lin, and Huang [2009a] compared the classification accu- racy of two learning tools, namely back-propagation network (BPN) and Bayesian networks, from the original dataset to the virtual sets of difference sizes. Their results shows an overall improvement of classification performance when these ar- tificial sample sets are utilised. For their experiment on BPN, from the original dataset size of 21 data points to virtual sets of size 84 led to increase in average accuracy from 62% to 90%. For their test on BN, average accuracy of the classifi- cation process rose from 57% to 94%.

Base on the research in this and other related paper, we are inclined to agree that artificial sample sets do contribute to the improvement of the classification per- formance of the NBC. Though it is worth noting that these additional data is not the same as the real data, but a generated data drawn from the population estimate through the employment of the MTD technique.

However, we will notice from the last three columns of Table 4.5 that the experimental results for the final method of applying the MTD technique and then the MDLP on the training set lead to rather unconvincing results. This result and suspected reasons leading to it is further discussed in the following section 4.4.

4.4 Issues encountered

As highlighted briefly in the previous section, the final method of applying the MDLP and

then the MTD technique on the training sets returned predictions that were obviously unacceptable. Some class predictions were almost 100% incorrect and others were not far behind. Unfortunately, many days of debugging and testing on different datasets yielded similar incorrect results. Corrections and reviewing of the functions were made extensively, but we can only attribute the cause of this prediction error to the MDLP function in R, as well as some potential unreliability

32

issues of the MTD function written for this dissertation.

We mentioned previously that the function for implementing the MDL prin- ciple is prewritten and this led to the problem of flexibility, in terms of the user not being able to define the bin size required to compress the data during discreti- sation. Because the function defines the number of bin itself, this could have led to an over-compression of the data. Over-compressing the dataset can cause a loss of vital information that is required for both the MTD technique and in the classi- fication learning process. This is suspected to the be cause of the inconsistent results.

Furthermore, the MTD function devised for this dissertation have not been tested rigorously prior to its use in this study. Although it performed without problems on its own, It may have been that the function was not coded to handle the results from the MDLP as the input variable properly. Thus we suggest that further study and proper evaluation of the functions is require, so that the functions performs properly when applying both the MDLP and MTD methods together.

33

Chapter 5

Conclusion and further research

In this dissertation we studied how classifiers can be applied practically in the mar- keting context. Initially, we introduced the concept of a classifier and the Bayes theorem. Then showed how a normal classifier converts to a naive Bayesian clas- sifier when posterior probabilities is used as the discriminant function; along with the important underlying assumption that the attributes are independent given the class. Although many studies agrees that the NBC forms an efficient classification model, we also highlighted its drawbacks and looked into two methods of optimising the NBC suggested by Martinez-Arroyo and Sucar [2006] namely, discretisation and structural improvement. For the discretisation process we learn about the MDL principle, an entropy based method of compressing a set of continuous data base on any regularity, which we applied in our experimental dataset of this paper. We have also briefly looked into methods for structural improvement of the NBC including elimination, combination and modification of attributes to address any dependan- cies.

Sample sets are vital for the classification learning process (or machine learn- ing for that matter) in the DSS, so that organisations can to make rational and correct decisions. Unfortunately insufficient sample sets is a reoccurring problem, especially for small or new organisations. The reasons for such cases are usually associated with high cost and being time consuming to gather sufficient data. This issue was addressed in this dissertation with the proposed use of the mega trend diffusion technique. The MTD technique is a systematic method of acquiring addi- tional "hidden" data relevant information, which is not explicitly provided by the data itself. This paper detailed the steps of establishing the membership function and virtual sets required in the MTD technique.

34

For this dissertation the naive Bayes classifier, proposed method of discreti- sation and MTD technique is applied to a real world problem of deciding to build more day care centres in the districts of the UK. The main contribution of this pa- per is to suggest the integration of MDLP discretisation method and (or) the MTD technique with the NBC to generate a simple classifier that works well with limited data. The experimental results shows that applying discretisation and the MTD technique (independently) on the sample set yields promising classification results as compared to an unmodified sample set used to train the NBC. In addition to the experimental result obtained in this dissertation, we also featured results from other research paper of similar nature.

We did, however, encounter displeasing results (as shown in the last three columns of Table 4.5) when we use generated virtual set from the method that combines both the MDLP and MTD technique to train the classifier. The initial diagnosis of this problem suggest that the problem may be due to the function that implements the MDL principle. It was believed that the readily available MDLP function was not flexible, in the sense that limited input variables were allowed. This meant that we were only able to input the dataset needed to be discretised, but were unable to define the bin size which the algorithm splits the data into during the discretisation process. Consequently, the function decides the number of bin it uses to split the data itself, which may have led to over-compression of the data and the lost of relevant information. Additionally, the function for the MTD technique was only formulated for use in this research, and have not had the opportunity of going through rigorous testing or debugging. The MTD function was suspected to be incapable of handling the results from the MDLP properly, hence the inconsis- tency in the results.

This paper have studied the implementation of discretisation of attributes for building the NBC; as well as the employment of the MTD technique on limited sample sets for use in NBC learning. But further research is needed regarding the integration of both the discretisation and the MTD procedure. A replication of this research but with improved algorithms and functions for the computation step may lead to better results. Alternatively, additional research is needed to study the possibility of implementing the MTD technique for use with other structure of learning algorithms, apart form the Bayesian classifier. In the marketing domain, where scarcity of data required for decision making poses an issue, the MTD tech-

35

nique might just be the answer that marketing managers and decision makers need. Hence the MTD technique itself forms an interesting topic for further study.

36