CHAPTER 2

LITERATURE REVIEW

2.1 INTRODUCTION

This chapter is designed to provide background information and reviewing the characteristics of data warehouse, OLAP multidimensional database Concept, data mining model and the application of data mining. Within this research, the concept, design and implementation approaches in developing a complete data warehouse technology framework for deploying a successful model with the integration of OLAP Multidimensional Database and data mining model.

Section 2.2 discussed about the fundamental of data warehouse, data warehouse model and also the Extract, Transform and Loading (ETL) of raw database to data warehouse. It includes research and study on existing data warehouse models authored by William Inmon, Ralph Kimball and various scholars venturing into data warehouse models. Section 2.3 introduces background information of OLAP. It includes the studies and research on various OLAP models, OLAP architecture and concept on processing multidimensional databases, multidimensional database schemas design and implementation in this research. It includes studies and research on schema design and implementation. Section 2.4 introduces fundamental information of data mining. It includes studies and research on the available techniques, method and process for OLAP Data Mining. Section 2.5 discussed the product comparisons for data warehouse, data mining and OLAP by Mitch Kramer. It includes the reason why Microsoft is used to design and implement the new proposed model.

In this literature review, introduction to the relationships between data warehouse, OLAP multidimensional database and data mining model for deploying four experimental applications for benchmarking. This research also proves that the “new proposed model” data warehouse technology framework is ready to transform any type of raw data into useful information. It will also help us to review the new proposed model of each existing data warehouse & OLAP Multidimensional database framework.

2.2 DATA WAREHOUSE

According to William Inmon (1999), known as the “Father of Data Warehousing”, data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of the management's decision-making process. Data warehouse is a database containing data that usually represents the business history of an organization. This historical data is used for analysis that supports business decisions at many levels, from strategic planning to performance evaluation of a discrete organizational unit.

Data Warehouse is a type of database system aimed at effective integration of operational databases into an environment that enables strategic use of data (G. Zhou et al., 1995). These technologies include relational and multidimensional database management systems, client/server architecture, meta-data modelling and repositories, graphical user interface and much more (J. Hammer et al., 1995; V. Harinarayan et al., 1996).

Data warehouse currently are much a subject of researched is not only commonly used in business or finance sector but can be applied appropriately in various sectors. Data warehouse are designed for analyzing or processing of data into useful information using data mining tools for critical decision-making. Data warehouse provides access to difficult environments of an enterprise data

In these literature studies, two important authors are identified as the main contributors and co-founder in the area of Data Warehouse, William Inmon (1999; 2005) and Ralph Kimball (1996, 2000). Both author perceptions on data warehouse design and architecture differ from one another. According to Inmon (1996), data warehouse is a dependent data mart structure, whereas Kimball (1999) defined data warehouse as a bus structure which is a combination of data mart populated together as a data warehouse. Table 2.1 discussed the differences in data warehouse ideology between William Inmon and Ralph Kimball.

Table 2.1 William Inmon and Ralph Kimball Data Warehouse Differences

William Inmon

Ralph Kimball

Paradigm

Inmon's Paradigm: An enterprise has one data warehouse, and data marts source their information from the data warehouse. Information is stored in 3rd normal form.

Kimball's Paradigm: Data warehouse is the collection of heterogeneous data marts within the enterprise. Information is always stored in the dimensional model.

Architecture

Architecture: Using TOP-DOWN approach

Architecture: Using Bottom-up approach

Concept

Data's integration from various systems to centralized repository

Concept of dimensional modelling (Bridging between Relational and multidimensional DB)

Design

The design pattern dependent on 3rd normalization form, purpose is for data granularity.

Data's marts are connected in a bus structure. Data's marts are the "union" of data warehouse. This approach is known also as Virtual Data Warehouse.

ETL Methods

Data's extraction from operational data sources. Data are feed in staging database area. Data are then transformed, integrate, and consolidate and transfer to Operational Data Store database. Data are then load to data mart.

Data extracted from legacy system and then consolidated and verified in staging database. Data feed into ODS and more data us added/updated. Operational Data Store contains fresh copy data that is integrated and transformed to the data mart structure.

Data mart

Data Marts are available as a subset of the data warehouse.

Data Marts can be placed at different at different servers or in geographical locations.

Based on this Data Warehouse literature, both Inmon (2005) and Kimball (2000) have different philosophies, but they do have similar agreement on a successful design and implementation of data warehouse and data marts are mainly depending on the effective collection of operational data and validation of data mart. Both approaches having the same database staging concepts and ETL process of data from a database source. They also have a common understanding that independent data marts or data warehouses cannot fulfil the requirements of end users on an enterprise level for precise, timed and relevant data.

2.2.1 DATA WAREHOUSE ARCHITECTURE

Data warehouse architecture is a wide research area. It has many different sub-areas and it can be treated with different approaches in terms or analysis, design and implementation by different enterprise. In this research studies, the aim is to provide a complete view on data warehouse architecture. Two important scholars Thilini (2005) and Eckerson (2003) from TDWI will discussed in more details on the topic on data warehouse architecture.

According to Eckerson (2003), before implementing a successful business intelligence systems where users can use programs like specialized reporting tools, OLAP tools and data mining tools upfront, a data warehouse architecture model mainly concentrate on the database staging process from different integrated OLTP systems is responsible for the ETL to the whole process workable. Thilini (2005) conducted a two phase study survey on investigating which factors may influence the selection of data warehouse architecture. In Thilini literature study, there are five data warehouse architectures that are practice today as shown in Table 2.2.

Table 2.2 Data Warehouse Architectures (Adapted from Thilini, 2005)

Data Warehouse Architecture

Independent Data Marts

Independent data marts also known as localized and small sized data warehouses. It is mainly used by departments, divisions or regions of company to provide own operational databases. The data marts are different as the structures are different from different location with inconsistent database design which makes it difficult to analyze across the data marts. Thilini (2005) cited the work of Winsberg (1996) and Hoss (2002) that "It is common for organizational units to develop their own data marts". Data marts are best used as a prototype for adhoc data warehouse and as for evaluation before building a real data warehouse.

Data Mart Bus Architecture

Kimball (1996) pioneered the designed and architecture of data warehouse with unions of data marts which are known as the bus architecture. Bus architecture Data Warehouse is derived from the unions of the data marts which are also known as Virtual Data Warehouse.

Bus architecture allows data marts not only located in one server but it can be also being located on different server. This allows the data warehouse to functions more as virtual reality mode and gathered all data marts and process as one data warehouse.

Hub-and-spoke architecture

Inmon (2005) developed Hub and Spoke architecture. The hub is the central server taking care of information exchange and the spoke handle data transformation for all regional operation data stores. Hub and Spoke mainly focused on building a scalable and maintainable infrastructure for data warehouse.

Centralized Data Warehouse Architecture

Central data warehouse architecture almost similar to hub-and-spoke architecture without the dependent data marts. This architecture copies and stores heterogeneous operational and external data to a single and consistent data warehouse. This architecture has only one data model which are consistent and complete from all data sources.

According to Inmon (1999) and Kimball (2000), central data warehouse should have Database staging or known as Operational Data Store as an intermediate stage for operational processing of data integration before transform into the data warehouse.

Federated Architecture

According to Hackney (2000), Federated Data Warehouse is a integration of multiple heterogeneous data marts, database staging or Operational data store, combination of analytical application and reporting systems. The concept of federated focus on framework of integration to make data warehouse as greatest as possible. Jindal (2004) conclude that federated data warehouse approach are a practical approach for a data warehouse architecture as it is focus on higher reliability and provide excellent value if it is "well defined, documented and integrated business rules".

Thilini (2005) conclude that hub and spoke and centralized data warehouse architectures are similar and the survey scores are almost the same. Hub and spoke is faster and easier to implement because no data mart are required. For centralized data warehouse architecture scored higher than hub and spoke as for urgency needs for relatively fast implementation approach.

A data warehouse is a read-only data source where end-users are not allow to change the values or data elements. Inmon's (1999) data warehouse architecture strategy are different from Kimball's (1996). Inmon's data warehouse model splits data marts as a copy and distributed as an interface between data warehouse and end users. Kimball's views data warehouse as a unions of data marts. The data warehouse is the collections of data marts combine into one central repository. Diagram 2.1 illustrates the differences between Inmon's and Kimball's data warehouse architecture adapted from Mailvaganam, H. (2007)

Diagram 2.1 Inmon's and Kimball's Data Warehouse Architecture (adapted from Mailvaganam, 2007)

In this work, it is very important to identify which data warehouse architecture that is robust and scalable in terms of building and deploying enterprise wide systems. According to Laney (2000) and Watson, H. (2005), it is important to understand and select the appropriate data warehouse architecture and “the success of the various architectures” acclaimed by Watson. Analysis of this research proved that the most popular data warehouse architecture is hub-and-spoke proposed by Inmon as it is a centralized data warehouse with dependant data marts and second is the data mart bus architecture with dimensional data marts proposed by Kimball. The selection of the new proposed model will use the combination data warehouse architecture of hub-and-spoke and data mart bus architecture as the new proposed model data warehouse architecture are designed with centralized data warehouse and with data marts that can are used for multidimensional database modelling.

2.2.2 DATA WAREHOUSE EXTRACT, TRANSFORM, LOADING

Data warehouse architecture begins with extract, transform, loading (ETL) process to ensure the data passes the quality threshold. According to Evin (2001), it is essential that right data are important and critical for the success on an enterprise. ETL are an important tool in data warehouse environment to ensure data in the data warehouse are cleansed from various systems and locations. ETLs are also responsible for running scheduled tasks that extract data from OLTPs. Typically, a data warehouse is populated with historical information from within a particular organization (Bunger, C. J et al., 2001). The complete process descriptions of ETL are discussed in table 2.3.

Table 2.3 Extract, Transform, and Load Process in Data Warehouse architecture

Process

Descriptions

Extract

Extract are the first process which involve in moving data from operational databases into database staging area or operational data store before populating into the data warehouse. In this stage, operational databases data need to be examined by extracting into the staging area for handling exceptions and fix all errors before it enters into data warehouse as this will save lots of time when loading into the data warehouse.

Transform

In completion of data extraction in database staging area, it is then transform to ensure data integrity within the data warehouse. Transformation of data can be done in several methods such as filed mapping and algorithm comparisons.

Load

After extract and transform of data, it is finally loaded into data warehouse (in Inmon's model) or into data marts (in Kimball's model). Data loaded into data warehouse are quality data after the process of extraction where erroneous data are removed and data are transform to ensure integrity of the data.

Calvanese, D. et al. (2001) highlight an enterprise data warehouse database tables may be populated with a wide variety of data sources from different locations and often including data providing information concerning a competitor business. Collecting all the different data and storing it in one central location is an extremely challenging task where ETL can make it possible. ETL process as depicts in Diagram 2.2 begins with data extract from operational databases where data cleansing and scrubbing are done, to ensure all data's are validated. Then it is transformed to meet the data warehouse standards before it is loaded into data warehouse.

Diagram 2.2Extract, Transport, Load Process

G. Zhou et al.(1995) emphasise on data integration in data warehousing stress that ETL can assist in import and export of operational data between heterogeneous data sources using OLE-DB (Object linking and embedding database) based architecture where the data are transform to populate all quality data into data warehouse. This is important to ensure that there are no restrictions on the size of the data warehouse with this approach.

In Kimball (2000) data warehouse architecture model depict in Diagram 2.3, the model focus in two important modules, “the back room” “presentation server” and “the front room”. In the back room process, where the data staging services in charge of gathering all source systems operational databases to perform extraction of data from source systems from different file format from different systems and platforms. Second step is to run the transformation process to ensure all inconsistency are removed to ensure data integrity. Finally, it is loaded into data marts. The ETL processes are commonly executed from a job control via scheduling task. The presentation server is the data warehouse where data marts are stored and process here. Data stored in star schema consist of dimension and fact tables. This is where data are then process of in the front room where it is access by query services such as reporting tools, desktop tools, OLAP and data mining tools.

Diagram 2.3 Data Warehouse Architecture (adapted from Kimball, 2000)

Nicola, M (2000) explains the process of retrieving data from the warehouse can vary greatly depending on the desired results. There are many form of possible retrieval from a data warehouses and it is flexibility that will drive how this retrieving process can be implemented. There are many tools for retrieving the data warehouse, such as building simple query and reporting through SQL statements. The tools may expand to OLAP and data mining, where the structure includes many more third party tools. There are many inherent problems associated with data, which includes the limited amount of portability, and the often-vast amount of data that must be sifted through for each query.

Essentially, ETL are mandatory for data warehouse to ensure data integrity. There are many factors to be considered such as complexity and scalability are among the two major issues that most enterprise faces by integrating information from different sources in order to have a clean and reliable source of data for mission critical business decisions. One way to achieve a scalable, non-complex solution is to adopt a “hub-and-spoke” architecture for the ETL process. According to Evin (2001), ETL best operates in hub-and-spoke architecture because of its flexibility and efficiency. Because of its centralized data warehouse design, it can influence the maintaining full access control of ETL processes. Also, empowers the usage of analytical and data mining tools by knowledge workers.

In this study on ETL for effective data warehouse architecture, it is known that Hub-and-spoke is best for data integration as it has the similarity in Inmon and Kimball architecture. The hub is the data warehouse after processing data from operational database to staging database and the spoke(s) are the data marts for distributing data. Inmon and Kimball also recommend same ETL processes to enable hub-and-spoke architecture. Sherman, R (2005) state that hub-and-spoke approach uses one-to-many interfaces from Data warehouse to many data marts. One-to-many are simpler to implement, cost effective in a long run and ensure consistent dimensions. Compared to many-to-many approach it is more complicated and costly.

In this work on the new proposed model, hub-and-spoke architecture are use as “Central repository service”, as many scholars including Inmon, Kimball, Evin, Sherman and Nicola adopt to this data warehouse architecture. This approach allows locating the hub (data warehouse) and spokes (data marts) centrally and can be distributed across local or wide area network depending on business requirement. In designing the new proposed model, the hub-and-spoke architecture clearly identifies six important data warehouse components that a data warehouse should have, which includes ETL, Staging Database or operational database store, Data marts, multidimensional database, OLAP and data mining end users applications such as Data query, reporting, analysis, statistical tools. However, this process may differ from organization to organization. Depending on the ETL setup, some data warehouse may overwrite old data with new data and in some data warehouse may only maintain history and audit trial of all changes of the data. Diagram 2.4 depicts the concept of the new proposed model data warehouse architecture.

Diagram 2.4 New Proposed Model Data Warehouse Architecture

2.2.3 DATA WAREHOUSE FAILURE AND SUCCESS FACTORS

Building a data warehouse is indeed challenging as data warehouse project inheriting a unique characteristic that may impact the overall setup if the analysis, design and implementation phase are not properly done. In this research effort, it discusses the studies on failure and success factors in data warehouse project. Section 2.2.3.1 focuses on the investigation on data warehouse project failure and section 2.2.3.2 discuss and investigate mainly on the success factors by implementing the correct model to support a successful data warehouse project implementation.

2.2.3.1 DATA WAREHOUSE FAILURE FACTORS

Hayen, R.L. (2007) studies shows that implementing a data warehouse project is costly and risky as a data warehouse project can cost over $1 million in the first year. It is estimated that one-half ad two-thirds of the effort of setting up the data warehouse projects attempt will fail eventually. Hayen R.L. (2007) citied on the work of Briggs (2002) and noticed three factors for the failure of data warehouse project that is Environment, Project and Technical factors as shown in Diagram 2.5 and table 2.4 discussed the factors in more details.

Diagram 2.5 Factors for Data Warehouse Failures (adapted from Briggs, 2002)

Table 2.4 Factors for Data Warehouse Failures (adapted from Briggs, 2002)

Factors

Descriptions

Environment

This leads to organization changes in business, politics, mergers, takeovers and lack of top management support. Also, including human error, corporate culture, decision making and change management.

Technical

Technical factors of a data warehouse project complexity and workload are taken too lightly where high expenses involving in hardware/software and people. Problems occurred when assigning a Project manager with lack of knowledge and project experience in data warehouse costing may lead to impediment of quantifying the return on investment (ROI). Also, failure of managing a data warehouse projects also includes:

· Challenge in setting up a competent operational and development team plus not having a data warehouse manager or expert that is politically sound.

· Having an extended timeframe for development and delivery of data warehouse system may due to lack of experience and knowledge for selection of data warehouse products and end-user tools.

* Failure to manage the scope of data warehouse project.

Project

Poor knowledge on the requirements of data definitions and data quality on different organization business departments. Also, Running a data warehouse projects with incompetent and insufficient knowledge in what technology to use may lead into problems later on data integration, data warehouse model and data warehouse applications.

Vassiliadis (2004) studies shows that data warehouse project failures are an enormous threat and threatened by factors such as design, technical, procedural and socio-technical as illustrated in Diagram 2.6. These factors of failures are vital in finding any unwanted action for success. Each factor group is described in table 2.5.

Diagram 2.6 Factors for Data Warehouse Failures (adapted from Vassiliadis, 2007)

Table 2.5 Factors for Data Warehouse Failures (adapted from Vassiliadis, 2007)

Factors

Descriptions

Design

Design factors in data warehouse project can put up with "No Standard" techniques or design methodologies. A data warehouse project when doing the analysis and design phase may accept ideas on metadata techniques or languages and data engineering techniques. Also, a proprietary solutions and also recommendations from vendors or in-house experts may define the design of the data warehouse blueprint landscape.

Technical

Technical factors associate to the lack of know-how experience in evaluation and choices of hardware setup for data warehouse systems

Procedural

Procedural factors concerning on the imperfection of data warehouse deployment. This factor focuses on training the end-users extensively on the new technology and the design of data warehouse which are completely different than the conventional IT solutions. users communities plays a vital role and are crucial in this factor.

Socio-Technical

Socio-technical factors in a data warehouse project may lead into problems on violation of organization modus operandi where the data warehouse systems will lead into restructuring or reorganization on the way organization operates by introducing changes to the user community.

According to Vassiliadis (2007) also, another potential factors for the failure of data warehouse projects are the “data ownership and access”. This is considered vulnerable within the organization and one mustn't share nor acquire someone else data as this is comparable with losing authority on the data ownership and access. Also, restrict any departmental declaration or request to own a total ownership of pure clean and error free data as this might cause potential problem on ownership data rights.

Watson (2004) stress that the general factors for the failures in data warehouse project success comprises of “weak sponsorship” and top management support, inadequate funding and users participation and organizational politic.

2.2.3.2 DATA WAREHOUSE SUCCESS FACTORS

Data Warehouse Failures can lead into disastrous implementation if careful factors or measures are not taken into serious considerations as discussed in section 2.2.3.1 based on Briggs (2002) and Vassiliadis (2004) studies that may lead into data warehouse failures. According to Hwang M.I. (2007), data warehouse implementations are an important area of research and industrial practices but only few researches made an assessment in the critical success factors for data warehouse implementations. No doubt there is procedure for data warehouse design and implementation but only certain guidelines are subjected for experimental testing. So, it is best to decide and choose the proper data warehouse model for implementation success.

In this study on identifying and filling the gap analysis of the data warehouse success factors, a number of success factors are gathered from data warehouse scholars and professionals (Watson & Haley, 1997; Chen et al., 2000; Wixom & Watson, 2001; Watson et al., 2001; Hwang & Cappel, 2002; Shin, 2003) to validate their experimental work and research strength individually on various characteristics of data warehouse success. This study beneficial in planning and implementing data warehouse projects and direct into the success of designing and implementing the new proposed model in this research.

There are several success factors in designing and implementing data warehouse solutions and the most important success factors depend on the data warehouse model selection, as different organization may have different scope and road maps in the development of data warehouse. The results of building a successful data warehouse are then used to quantify the factors that are used and also prioritize those factors that are beneficial for continued research purpose to improve and enhanced the data warehouse model success.

According to Hayen, R.L. (2007), data warehouse is a complex system which can complicate business procedures. The complexity of data warehouse prevents companies from changing data or transaction which are necessary. It is important then to analyze on which data warehouse model to be used for such complex systems that are sound critical to an organization. Hwang M.I. (2007) conducted a study on data warehousing model and success factors as a critical area of practice and research but only a few studies have been accomplish to measure the data warehouse projects and success factors.

Many scholars had conducted a profound research in the area of data warehouse and may have succeeded or failed due to possible reasons based on each scholars outcomes on the research. It is useful inspect a few case studies on a selected companied data warehouse implementation and to experiment the failure and success factors through survey. (Winter, 2001; Watson et al., 2004)

Hwang M. I. (2007) conducted a survey study on six data warehouse scholars (Watson & Haley, 1997; Chen et al., 2000; Wixom & Watson, 2001; Watson et al., 2001; Hwang & Cappel, 2002; Shin, 2003) on the success factors in a data warehouse project. Each scholar has different success factors that are measures in a project. Table 2.6 shows the mentioned six scholars survey study on data warehouse, Watson (1997) measures data warehouse success factors, Chen et al. (2000), Watson et al. (2001) and Shin (2003) measures data warehouse implementation factors and Hwang (2002) measures through development and management practices. Only Wixom (2001) as shown in diagram 2.7 measures both Data warehouse implementation and success factors which can be used as a model for a successful data warehouse implementation. Study shown in all 6 scholars' review, without having data warehouse implementation and success factors, the consequences of any factors on a data warehouse success cannot be validated.

Table 2.6 Factors for Data Warehouse Success (adapted from Hwang M.I., 2007)

Study

Data Warehouse Success Factors

Data Warehouse Implementation Factors

Results Reported

Watson & Haley (1997)

Focus on user involvement and support by having a clear and understandable business needs. Using methodology and modelling methods in data warehouse by targeting on "clean data". Thus, support from upper management to contribute on the success.

N/A

Ordered list of success

Chen et al.

(2000)

N/A

Focused on exactness and preciseness of User satisfaction by Support and realization of end users needs.

Support for end users affects user satisfaction

Wixom & Watson (2001)

Implementation factors include management support, resources, User participation, team skills, Source systems

aand development technology which contribute to the implementation success based on organizational, project and

technical approach.

Factors include Data and System quality to perceived data warehouse success

Some success factors affect DW success

Watson et al. (2001)

N/A

Having users to produce more and better requirement information about the system instead of developers. Thus, this is a better decision approach and improves the business process. Support for the accomplishment of strategic business objectives.

Ordered list of success

measures

Hwang & Cappel (2002)

N/A

N/A

Development/management Practices

Shin (2003)

N/A

Factors include System, Information and service quality to ensure user satisfactions.

System quality affects user satisfaction

Diagram 2.7 Wixom Data Warehouse Research Model (adapted from Wixom, 2001)

In this study on data warehouse failure and success factors for developing the new proposed model in this research, Hwang M.I. (2007) research model as depict in diagram 2.8 are applied in this work. Hwang M.I. selected eight implementation factors which will directly affect the six selected success variables by combining the lists of the research studies as depict in Diagram 2.8.

Diagram 2.8 Hwang M. I. Data Warehouse Research Model (adapted from Hwang M.I. 2007)

The research model serves as an important guideline for implementing a successful data warehouse projects. No doubt there are lots of guidelines by other researchers, but in this study shows that Hwang M.I. (2007) research model “facilitate research integration and variable selection for current use and future research. The model is general and new factors or success variables, when identified, can be added easily”. This is the major advantages for using this model throughout the research on building effective data warehouse systems.

2.3 ONLINE ANALYTICAL PROCESSING

OLAP Council (1997) define OLAP as a group of decision support software technology that facilitate business analysts, managers and management executives to acquire fast, consistent and interactive access of information that has been reformulate, transformed and summarized from relational data files mainly from data warehouse into multidimensional data views allowing optimal data retrieval and for performing trend analysis.

Essentially, OLAP create “hypercube” or “cube” of information allowing more composite analysis compares to traditional relational database. OLAP analysis techniques are “slice and dice” and “drill down”. This allows analysts to segregate data into loads of information depending on the analyst's choice.

According to Nigel P. (2008), OLAP also implies "digging through tons of data" to uncover pattern and relationships contained within the business activity and history. OLAP permit the analysis of database information from multiple database source systems at one time. Processing database information using OLAP required an OLAP server to organize and transformed and build multidimensional database. Multidimensional database are then separated by cubes for client OLAP tools to perform data analysis which aim to discover new pattern relationship between the cubes. Some popular OLAP server software programs include Oracle, IBM and Microsoft.

According to Chaudhuri (1997), Burdick D. (2006) and Vassiladis, P. (1999), OLAP is rising into a powerful concept for strategic database analysis. OLAP have the ability to analyze large amount of data for the extraction of valuable information. Analytical development can be of business, education or medical sectors. The technologies of data warehouse, OLAP, and analyzing tools support that ability.

Madeira (2003) supports the fact that OLAP and Data Warehouses are complementary. A Data Warehouse stores and manages data. OLAP transforms Data Warehouse data into strategic information. OLAP ranges from basic navigation and browsing (often known as "slice and dice"), to calculations, to more serious analyses such as time series and complex modelling. As decision-makers exercise more advanced OLAP capabilities, they move from data access to information to knowledge.

2.3.1 OLTP vs. OLAP

The design of a data warehouse and online analytical processing (OLAP) multidimensional cubes is vitally different than an online transactional processing database (OLTP). The data warehouse is specifically designed to facilitate fast query times and multi-dimensional analysis. While OLTP or Online Transactional Processing systems are information systems that support daily processing in an organization. OLTP system main function is to capture information and to get data into computers. OLTP allow effective data manipulation and storage of information for daily operational resulting in huge quantity of transaction data. Relational databases are designed for OLTP systems and OLTP maximize the efficiency in operating on a few records concurrently in predictable patterns. Organisations build multiple OLTP systems to handle the aspect of the daily operations; huge quantities of transactional data can be accumulated in a short period of time.

OLAP or Online Analytical Processing is a class of software technologies that are designed for data access and analysis. OLAP is facilitating application technology and database model that grants managerial users for strategic decision making capabilities. OLAP database model are known as multidimensional database technology. OLAP concentrate on this problem by aggregating data into a multi-dimensional view for direct access into the OLAP tools without hindering the system performance.

It is complicated to merge OLAP and OLTP in one centralized system. The dimensional data design model used in OLAP is much more effective for querying than the relational model used in OLTP system. OLAP may use that one central database as data source and OLTP used different data from several sites. The dimensional design of OLAP is not suitable for OLTP system, mainly due to redundancy and the loss of referential integrity of the data. Organization chooses to have two separate information systems, one OLTP and one OLAP system.

According to Poe (1983), OLAP systems are mainly done through evaluation or analysing patterns and trends of the data mainly in centralized data warehouse database. Compared to OLTP, it is difficult to perform such pattern analysis as information accesses are stored in diverse systems across several departments in the organization.

One might argue that the purpose of OLTP systems is to get data into computers, whereas the purpose of OLAP is to get data or information out of computers. The following According to Han (2001), OLTP systems are known as "Customer oriented" and OLAP is a “market oriented”. Table 2.7 summarizes the major differences between OLTP and OLAP system adapted from Han (2001).

Table 2.7 Differences between OLTP and OLAP (adapted from Han, 2001)

Feature

OLTP

OLAP

Characteristic

Operational processing

Informational processing

Orientation

Transaction

Analysis

User

Clerk, database administrator (DBA), database professional

Knowledge worker (e.g. manager, executive, analyst)

Function

Day to day operations

Long term informational requirements, decision support

Database Design

ER based application, application oriented

star/snowflake, subject oriented

Data

Current, guaranteed up to date

Historical, accuracy maintained over time

Summarization

Primitive, highly detailed

Summarized, consolidated

View

Detailed, flat relational

Summarized, multidimensional

Unit of Work

Simple transaction

Complex query

Access

Read/write

Read

Focus

Data in

Information out

Operations

Index/hash on primary key

Lots of scans

Number of records accessed

Tens

Millions

Number of users

Thousands

Hundreds

Database size

100MB to GB

100GB to TB

Priority

High performance, high availability

High flexibility, end-user autonomy

Metric

Transaction throughput

Query throughput

2.3.2 TWELVE RULES OF OLAP

The term OLAP was discovered and first used by E. F. Codd, father of the relational database. In 1985, E.F. Codd wrote a paper in defining rules for Relational Database Management Systems (RDBMS) which modernized the IT database industry. Later in E.F. Codd 1993 paper (Codd, 1993), E.F. Codd not only invent OLAP but also provided a set of procedures which are known as the “Twelve Rules” for OLAP product ability assessment which include data manipulation, unlimited dimensions and aggregation levels and flexible reporting. OLAP is a software and data processing which allows consolidation and analysis of data in a multidimensional space. E. F. Codd “Twelve rules” for the OLAP systems are shown in Table 2.8:

Table 2.8 E.F. Codd “Twelve Rules” for OLAP Systems (adapted from Codd, 1993)

Rules

Descriptions

Multi-dimensional Conceptual View

Data should be presented to the user in a multi-dimensional pattern in a form of pivoting, rotating and “slicing and dicing”. Multidimensional representation model are more towards data manipulation rather that querying data.

Transparency

OLAP tools complexity and process orchestration should not introduce and should be hidden from users. A typical user is only interested and uses the graphical user interface of the tools without complicate the exiting productive OLAP system. Also, users are transparent from using the system without knowing that they are accessing the OLAP database.

Accessibility

OLAP tool should perform all process orchestration including mappings and translation without having any interface from the users. Users only interact with the relevant data needed for analysis in a consistent way.

Consistent Reporting Performance

Data size to be analyzed should not impact the reporting performance regardless of the number of dimensions use. Reliable reporting performance guarantee users can use the system in any ways rather than working around problems in the system.

Client-Server Architecture

OLAP systems should deploy in client server architecture as to separate the client by using the OLAP tools to manipulate the data in the OLAP server. It is essential that OLAP server component are capable of combining these sources effectively and as it allow multiple attached clients.

Generic Dimensionality

Dimensions are all equivalent and should all functionally and operationally are the same. No bias or preference should be given to operations on any of the dimensions.

Dynamic Sparse Matrix Handling

OLAP tool must also adapt its physical schema to accommodate any data distributions such as null values should be stored in an efficient manner without having problem to the user. OLAP tool should be able to accommodate spontaneously large and sparse data without any undesirable impact on performance.

Multi-User Support

OLAP tools should support in a multi user environment with parallel access for each users to be able to view information in their own flavour.

Unrestricted Cross-Dimensional Operations

OLAP tool should aggregate all rules such as understanding hierarchies, arrays of relationships and dependencies across all dimensions and provide these features for users to navigate without having user to perform any of these events.

Intuitive Data Manipulation

OLAP tools should also have “all-in-one” views of data where everything are there without having additional options like menus and commands buttons which complicate users across the interface.

Flexible Reporting

Users should be able to present reporting in any methods depending on users' choice.

Unlimited Dimensions and Aggregation Levels

OLAP should not bind to any limits on defining the number of dimensions and levels in a model.

In this literature study, Codd twelve rules of OLAP are applied when designing and implementing the new proposed model OLAP systems. In Table 2.9 explains how the new proposed system satisfies Codd's twelve rules of building an OLAP system.

Table 2.9 New Proposed Model Fulfilled E.F. Codd Twelve Rules for OLAP Systems

Rule

Satisfies (Yes/No)

Explanation

Multi-dimensional Conceptual View

Yes

System can present the data in multidimensional database cubes to the users

Transparency

Yes

Users are hidden from knowing that they are using OLAP multidimensional database. Cubes are used as selection to users.

Accessibility

Yes

OLAP systems chooses the best sources of data from tools such as pivoting, writing MDX queries and drill down data.

Consistent Reporting Performance

Yes

System performances are the same regardless of the number of dimensions in use.

Client-Server Architecture

Yes

System deployed in client server architecture. Separation in between OLAP server and client tools.

Generic Dimensionality

Yes

Dimensions are equal when the systems browse the cube data.

Dynamic Sparse Matrix Handling

Yes

System handles internally all null values without affecting the users.

Multi-User Support

Yes

System allows multiple data access to the cubes defines.

Unrestricted Cross-Dimensional Operations

Yes

Systems allow data roll up through levels of hierarchy.

Intuitive Data Manipulation

Yes

The users views of data contain everything required and no additional menus needed.

Flexible Reporting

Yes

User can present the data any way they prefer. There are 4 types of tools being implemented in this study.

Unlimited Dimensions and Aggregation Levels

Yes

There is no limit to the number of dimensions and levels in the systems.

Apart from the Codd twelve rules, Berson, A. (2001) stressed that a powerful quality OLAP system should also support a complete database management tools as a utility for integrated centralized tool to permit database management to perform distribution of databases within the enterprise. Also, OLAP ability to perform drilling mechanism within the multidimensional database allows the functionality of drill down right to the source or root of the detail record level. This implies that OLAP tool permit a smooth changeover from the multidimensional database to the detail record level of the source relational database. OLAP systems also must support incremental database refreshes. This is an important feature as to prevent stability issues on operations and usability problems when the size of the database increases.

2.3.3 OLAP FUNCTIONALITY

According to OLAP Council (1997), OLAP functionality offers dynamic multidimensional analysis supporting end users with analytical activities includes calculations and modelling applied across dimensions, trend analysis over chronological time periods, slicing subsets for on-screen viewing, drill-down/drill-up to deeper levels of records as depict in Diagram 2.9, reach-through to underlying detail data, rotation to new dimensional comparisons in the viewing area. OLAP is implemented in a multi-user client/server environment and provide reliably fast response to queries, in spite of database size and complexity. OLAP facilitate the end user integrate enterprise information through relative, customized viewing, analysis of historical and present data in various "what-if" data model scenario. This is achieved through use of an OLAP Server as depicted in diagram 2.10.

OLAP functionality is provided by an OLAP server. OLAP Council (1997) defined OLAP server as a “high-capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures”. OLAP server design and data structure are optimized for fast information retrieval in any course and flexible calculation and transformation of unprocessed data. The OLAP server may either actually carry out the processed multidimensional information to distribute consistent and fast response times to end users, or it may fill its data structures in real time from relational databases, or offer a choice of both.

Diagram 2.9 OLAP Drill Down/Drill up Concept

Diagram 2.10 OLAP Server (adapted from Berson, 2001)

2.3.4 OLAP ARCHITECTURE

According to Nigel, P. (2006), it is important to differentiate the ability of a Data Warehouse from an OLAP system. To distinguish a Data Warehouse, which is usually based on relational technology, OLAP uses a multidimensional view of aggregate data to provide rapid access to strategic information for further analysis. OLAP architecture is divided into three main type based on the method in which they store multi-dimensional data and perform operations on that data. The major categories are multi-dimensional OLAP (MOLAP), relational OLAP (ROLAP) and hybrid OLAP (HOLAP) as described below (Berson, A., 2001; Burdick, D. et al., 2000; Vassiliadis, P., 1999; Pedersen, T. B., 2001.):

• MOLAP

In MOLAP as depict in Diagram 2.11, data is stored and summarized in a multidimensional cube. MOLAP are the fastest. The storage is not in the relational database but in proprietary formats which is in multidimensional view. This is the more conventional way of OLAP analysis compared to ROLAP. In MOLAP, data is stored in multidimensional cube and the storage is not in the relational database but in proprietary formats.

Diagram 2.11 MOLAP Architecture (adapted from Microsoft, 1999)

MOLAP cubes are excellent in performance because it is built and designed for rapid data retrieval, and are most advantageous for slicing and dicing operations. MOLAP can perform complex calculations where all calculations have been pre-generated when the cube is created. MOLAP is restricted to the amount of data it can manage as the calculations process are executed when the cube is construct and large amount of data are not possible to be included into the cube itself. MOLAP cube technology is proprietary and requires additional investment in human and capital resources.

• ROLAP

In ROLAP as depict in Diagram 2.12, data and aggregations are stored in relational database tables to provide the OLAP slicing and dicing functionalities. ROLAP are the slowest among the OLAP flavours. ROLAP relies on data manipulating directly in the relational database to give the manifestation of conventional OLAP's slicing and dicing functionality. Fundamentally, each slicing and dicing action is equivalent to adding a "WHERE" clause in the SQL statement.

Diagram 2.12 ROLAP Architecture (adapted from Microsoft, 1999)

ROLAP can manage big amounts of data and ROLAP do not have any limitations for data size. ROLAP can influence the intrinsic functionality in a relational database. ROLAP are slow in performance because each ROLAP activity are essentially a SQL query or multiple SQL queries in the relational database. The query time and number of SQL statements executed measures by its complexity of the SQL statements and can be a bottleneck if the underlying data size is large. ROLAP essentially depends on SQL statements generation to query the relational database and do not cater all needs which make ROLAP technology conventionally limited by what SQL functionality can offer.

• HOLAP

HOLAP as depict in Diagram 2.13, combine the technologies of MOLAP and ROLAP. Data are stored in ROLAP relational database tables and the aggregations are stored in MOLAP cube. HOLAP can "drill down" from multidimensional cube into the underlying relational database data. For summary type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Diagram 2.13 HOLAP Architecture (adapted from Microsoft, 1999)

According to Pederson T.B. (2001), OLAP is a technology that allows users to examine business data and to get familiar with the information it contains. OLAP data model is multidimensional so it is called a cube instead of a table. OLAP implies "digging through tons of data" to uncover patterns and relationships contained within the business activity and history.

Some studies (Burdick et al., 2000; Vassiliadis P., 1999; Harinarayan, 1996) have presented and discussed the importance of MOLAP specifications, which are based on the multiple measures of normalization in the database tables. MOLAP partitions store aggregations and a copy of the source data (fact and dimension data) in a multidimensional structure on the OLAP server. All partitions are stored on the OLAP server.

Harinarayan (1996) highlight that MOLAP requires the data to be stored in a multidimensional format as it involves the creation of multidimensional blocks called data cubes. The cube in MOLAP Architecture may have three axes (dimensions), or more. Each axis (dimension) represents a logical category of data. One axis may for example represent the geographic location of the data, while others may indicate a state of time or a specific school. Each of the categories, which will be described in the following section, can be broken down into successive levels and it is possible to drill up or down between the levels.

Cabibo (1997) states that MOLAP partitions are normally stored on an OLAP server, with the relational database frequently stored on a server separate from OLAP server. OLAP server must query across the network whenever it needs to access the relational tables to resolve a query. The impact of querying across the network depends on the performance characteristics of the network itself. Even when the relational database is placed on the same server as OLAP server, inter-process calls and the associated context switching are required to retrieve relational data. With a MOLAP partition, calls to the relational database, whether local or over the network, do not occur during querying.

2.3.5 MULTIDIMENSIONAL DATABASE SCHEMA

According to Kimball (1999), the base of every data warehousing system is a relational database built using a dimensional model. Dimensional model consists of fact and dimension tables and is typically described as star or snowflake schema. A schema is a collection of database objects, tables, views and indexes. To understand dimensional data modelling, Table 2.10 defines some of the terms commonly used in this type of modelling:

Table 2.10 Multidimensional Database Schema Terms

Terms

Descriptions

Fact Table

A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated.

Dimension Table

A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size as compared to fact table

Attribute

A unique level within a dimension. For example, Month is an attribute in the Time Dimension

Hierarchy

The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day

In designing data models for data warehouses, the most commonly used schema types are Star Schema and Snowflake Schema. In the star schema design, fact table sits in the middle and is connected to other surrounding dimension tables like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Most data warehouses use a star schemato represent the multidimensional data model. The database consists of a single fact table and a single table for each dimension. Each tuple in the fact table consists of a pointer or foreign key to each of the dimensions that provide its multidimensional coordinates, and stores the numeric measures for those coordinates. Each dimension table consists of columns that correspond to attributes of the dimension. Diagram 2.14 shows an example of a star schema For Medical Informatics System.

Diagram 2.14: Star Schema

Star schemas do not explicitly provide support for attribute hierarchies. For the implementation of the applications in this research in the MOLAP environment, it requires lots of hierarchies of dimension tables for efficient drilling of data's, the answers to this problem is the implementation of both star and snowflake schemas.

The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.

Snowflake schemasprovide a refinement of star schemas where the dimensional hierarchy is explicitly represented by normalizing the dimension tables, as shown in Diagram 2.15. This leads to advantages in maintaining the dimension tables. However, the denormalized structure of the dimensional tables in star schemas may be more appropriate for browsing the dimensions.

Diagram 2.15: Snowflake Schema

The multidimensional data model is implemented directly by MOLAP servers. When a MOLAP server is used, the multidimensional model and its operations have to be mapped into relations and SQL queries. The design of database schemas that reflect the multidimensional views of data. Entity Relationship diagrams and normalization techniques are popularly used for database design in OLTP environments.

Levene. M (2003) stresses that in addition to the fact and dimension tables, data warehouses store selected summary tables containing pre-aggregated data. In the simplest cases, the pre-aggregated data corresponds to aggregating the fact table on one or more selected dimensions. Such pre-aggregated summary data can be represented in the database in at least two ways. Whether to use star or a snowflake mainly depends on business needs. In this research, snowflakes are preferred as when there is a analysis of information at that a very detailed level, snowflake schema can use to drill out the information.

2.3.6 SUMMARY

In this work, MOLAP is used to create the multidimensional databases as MOLAP architecture can store data, create multidimensional mapping files, and calculate aggregations faster than a relational engine can create indexes and calculate aggregations. The primary reason is because the relational engine is slower is due to the large table joins that the relational engine must perform during the processing of a ROLAP partition. In addition, because the relational engine performs the actual processing tasks, competing demands for resources on the computer hosting the relational tables can negatively affect processing performance for a ROLAP partition.

There is no significant difference in processing performance between MOLAP partitions and HOLAP partitions. In both cases, all fact-level data is read from the relational database, and aggregations are calculated. With MOLAP, it writes the fact-level data into the MOLAP structure. With HOLAP, it does not store fact-level data. This difference has minimal impact on processing performance, but can have a significant impact on query performance. Because HOLAP and MOLAP processing speeds are approximately the same and MOLAP query performance is superior, MOLAP is the optimum storage choice. In chapter 4.3, the experiment on the OLAP performance assessment are best described the OLAP architecture.

If you choose to deploy a real-time ROLAP partition, you should make the ROLAP partition as small as possible and use MOLAP for all other partitions in the cube. This configuration will maximize query responsiveness and minimize processing time. Due to shrinking hardware and processing cost, MOLAP are generally most often used. HOLAP is a better solution if the solution is accessing a stand-alone database. ROLAP are more convenient to set up when the query demands are relatively low and also on a stand-alone database.

2.4 DATA MINING

According to Fayyad (1998), data mining is about discovering hidden patterns from historical datasets and applying patterns for prediction. Data mining involves searching for specific, unknown information within databases. Data mining is particularly proficient at discovering hidden trends in data by using pattern recognition in addition to statistical and mathematical techniques. Chung, H. (1999) referred data mining as Data mining is also referred to as knowledge discovery in databases and is the process of analyzing data from different perspectives and making it into useful information.

Freitas (2002) and Fayyad (1996) have recognized the advantageous tool of data mining for extracting knowledge from a data warehouse. The results of the extraction uncover hidden patterns and inconsistency that are not visible in the existing a data sets. The discovery of such hidden patterns and data inconsistency cannot be achieved by using conventional data analysis and query tools approaches. Data mining techniques vary from conventional data analysis approach as data mining involve in extracting hidden patterns in a dataset while conventional data analysis tool only assume on the result from a data set.

Many data mining scholars (Fayyad, 1998; Freitas, 2002; Han, J., 1996; Frawley, 1992) have defined and demonstrated data mining as a valuable tool for extracting useful knowledge from data warehouse. According to Frawley et al (1992), the term extracting is considered as process of mining hidden patterns and valuable information which are not evident in the existing data. Fayyad et al (1996) defined data mining as a step of knowledge discovery because knowledge discovery is a series of task which uses data mining to extract knowledge from a large pool of data. Han (2000) defined data mining as a process of extracting or mining data in a large database also known as data warehouse.

In this literature on data mining, there are several data mining techniques that are used to demonstrate different data mining technique in different areas of applications. Data mining techniques covers association, classification, clustering and prediction. Freitas (2002) stressed that for all data mining issues has to think about all potential of solving the issues using and trying different data mining techniques. On top of that, to carry out a successful data mining applications with the chosen data mining techniques, a process model is required as it include a series of steps that will guide to agreeable results. Chapter 2.4.1 will discussed about the data mining techniques available and also which techniques will be used in this study. Chapter 2.4.2 presents the detailed data mining process model available and also discussed the process model use throughout this research in deploying the experimental application tools which is further discuss in Chapter 4.

2.4.1 Data Mining Techniques

In general, data mining is capable of predicting or forecasting of future events based on historical data set and its purpose is to find hidden patterns in the database. There are several data mining techniques that are used and applied in the different areas of data and the knowledge on how each data mining technique method is essential to select the suitable technique for a specific area.

According to Mailvaganam (2007), data mining techniques consists of two models which is descriptive and predictive models as describe in Diagram 2.16. Descriptive models can be generated by employing association rules discovery and clustering algorithms. As for Predictive models, it is generated by using classification and regression algorithms. Descriptive models can provide hidden relationships knowledge in a give data set, for example, in student database, students who pass mathematics tends to pass science. Predictive models can influence the future results in a given data set, for example, in marketing, a customer's gender, age, and purchase history might predict the likelihood of a future sale.

Diagram 2.16 Descriptive and Predictive Model (adapted from Mailvaganam, 2007)

Data mining algorithm is the mechanisms that generate a data mining model. In order to generate a data mining model, a data mining algorithm needs to be define. The algorithm will then analyse a set of given data to investigate for an identifiable hidden patterns and trends results. This result will then be used by the algorithm to define parameters of the mining model. These parameters are then used across the whole data set to extract actionable patterns and detailed statistics. More details on the data mining algorithms are discussed as follows:

  • Association algorithm is a powerful correlation counting engine. It can perform scalable and efficient analysis in identifying items in a collection that occur together.
  • Classification is the process of finding a set of models that describes and distinguishes data classes or concepts for the purpose of being capable of using the model to predict a class of objects with unknown class labels. This is the decision tree algorithm, including both classification & regression. It can also build multiple trees in a single model to performance association analysis.
  • Clustering algorithm includes 2 different clustering techniques: EM expectation and maximization) and K-means. It automatically detects the number of natural clusters in the datasets and discovering groups or categories
  • Prediction can be viewed as a model constructed and used to access the class of a unlabeled sample or the value ranges of an attribute that a given sample is likely to have.

In data mining techniques, choosing the best algorithm are based on specific business user case. It is possible to use different data mining algorithm to perform mining on the same business user case data sets, each algorithm will produce different set of results and some data mining algorithms can produce more than one type of result. Data mining algorithms are flexible and do not require to be use separately. Having a single data mining solutions, first action is to use an algorithm to explore the data set and then use other algorithm to perform prediction on a specific result based on the explored data. In a specific data mining solution, some algorithms like clustering can be use to explore data which is use for recognize patterns and break data set into groups and then use other algorithms like decision trees model based on classification algorithm to predict a specific outcome based on that data.

Data mining models are used to predict values, find hidden trends and generate summaries data. It is important to know which data mining algorithm to use in order to run a business user case. Table 2.11 shows the suggestions on which algorithms to use for a data mining solutions adapted from SQL Server Developer Centre (2009).

Table 2.11 Guide to select algorithm for data mining solutions (adapted from Microsoft, 2009)

Business User Case

Algorithm

Predicting a discrete attributes

Classification

Clustering

Neural Network

Predicting a continuous attribute

Classification

Predicting a sequence

Clustering

Finding a groups of common items in transactions

Association

Classification

Finding groups of similar items

Clustering


A data mining model are built, deployed and trained, the result of the data mining model details is stored as data mining model nodes, which is used to collect the attributes, description, probabilities, and distribution information for the model element it represents and relation to other nodes. Every data mining model node has a connected node type that assist in signifying a data mining model. A data mining model node is the uppermost node, regardless of the actual structure of the model. All data mining models start with a model node.

In this study, decision tree and clustering model are the two main data mining techniques that will be used in this research. This data mining techniques and model will cover further discuss in Chapter 3 and 4.

2.4.1.1 Decision Tree Model

Decision tree are robust and standard data mining model for classification and prediction techniques. Decision tree are preferred in contrast to neural networks, decision trees representation of rules. These rules can easily expressed and understand. A decision tree model can be used to categorize an instance by initializing at the root of the tree and construct until the leaf node which provides the classification of the instance.

A decision tree model is a tree like structure using classification techniques, in which a node in the tree structure represents each question used to further classify data. Decision tree is efficient and can be built faster than other model and acquiring results with similar accuracy in some cases. Thus, it is appropriate for large training data set. Decision tree model is easy to understand and interpret depending on the complexity of the decision tree and it handles non numerical data. The various methods used to create decision trees have been used widely for decades, and there is a large body of work describing these statistical techniques. According to Witten et al (2000), decision tree model approach is known for its fast data mining modelling, as it uses divide and conquers approach.

Witten et al (2000) describe decision tree process is constructed recursively. A model is placed at the root node of the tree and make out one or more tree node with possible value. Tree nodes training sets are then split up into subsets makes up Decision Tree 1, Decision Tree 2 or more tree nodes. This process is repeated recursively for each branch until the node has the same classifications then the tree construction will stop. This means the leaf node with one class of “true” or “false” cannot be split further which resulted the recursive process to stop. The objective of decision tree model is to build as simplified decision tree model as possible to produce good classification or predictive performance results.

In decision tree-based model as shown in diagram 2.17, the model node serves as the root node of the tree. Decision Tree model may have many trees nodes that make up the whole structure, but there is only one tree node from which all other nodes such as interior and distribution nodes that are related for each tree. An interior node represents a split in the tree model or known as the “branch” node and a distribution node describes the distribution of values for one or more attributes according to the data represented by this node or known as the “leaf” node. A decision tree based model always has one model node and at least one or more than one tree node.

Diagram 2.17 Decision Tree Model

2.4.1.2 Clustering Model

Clustering is a data mining technique that is used to separate data set into groups or clusters based on the similarity between the data entities. Entities of the cluster share common features that differentiate them from other clusters. Similarity is measured in terms of distance between its elements or entities. Unlike classification, which has predefined labels (supervised learning), clustering is considered as unsupervised learning because it automatically comes up with the labels. Clustering model is a well-known data mining technique. Clustering is the classification of data into groups based on specific criteria.

According to Kogan, J. et.al. (2006), clustering techniques is divided into partitioning and hierarchical data mining techniques. Partitioning technique construct various partitions of similar and dissimilar items in a group or clusters evaluated by conditions. For hierarchy techniques, it builds hierarchical breakdown using a set of data progressively using either top-down approach or bottom up approach. Using top down approach begins with a cluster containing all data and breakdown into a smaller cluster known as sub clusters and top-down approach begins with small clusters and combine them recursively from larger cluster in a nested method. The advantage of hierarchical clustering compared to partition is that it is flexible as regards to the label of granularity. Clustering techniques are assessed in provisions of certain features related to size, distance between parts of the cluster or shape of the cluster. Clustering techniques support application that requires segmenting the data into common groups.

In clustering-based model as shown in diagram 2.18, the model node serves as the root node of the cluster. A cluster node gathers the attributes and data for the abstraction of the specific cluster. Basically, it gathers the set of distribution that .comprises a cluster of cases for the data mining model. A clustering based model constantly has one model node and at least one cluster node. Business users do not need to identify the number of clusters to be developed in advance. Clustering automatically creates the natural and exact number of clusters by specifying how similar the records within the individual clusters should be. Works best with categorical and non repetitive variables.

Diagram 2.18 Clustering Model

2.4.2 Data Mining Process Model

In order to carry out a successful data mining, a data mining process model is required. This process model involves a sequence of steps that will lead to good results. Some examples of these process models are CRISP (Chapman et al, 2000) and TWOCROWS (Two Crows, 1999). In this study, applications experimental tools are based on the CRISP data mining process model. The difference phases of CRISP data mining process model are presented in Diagram 2.19. The focus of this chapter is on the first three CRISP phases appropriate for the research objectives.

Diagram 2.19 CRISP Process Model

According to Chapman et al (2000), CRISP data mining model is a life cycle for a data mining project which describe the phases of a project. This also includes the tasks and relationship between the tasks. CRISP life cycle consists of six phases which includes business understand, data understanding, data preparation, modelling, evaluation and deployment, and the arrows indicate the most important and frequent dependencies between phases.

In CRISP data mining process model, it begins with the business understanding of the projects objectives and requirements as this is important to convert it to data mining problem definition. Next step is to perform data understanding with the datasets to identify data quality problem and to discover interesting subsets to form hypothesis for hidden information. After the identification of the datasets, data preparation phase will load all data into the modelling tools from the initial raw data. This phase will execute multiple times to complete the transformation and cleaning of data for modelling. In modelling phase, various techniques are used and applied for the data mining problem to have high quality models for data analysis and evaluation phase is executed to evaluate the model(s) as to review it thoroughly if it achieves the business objectives. Finally, deployment phase is executed as to produce simple reporting or complex data mining process as this phase mainly triggered by the end-users.

In this study, CRISP data mining process model are used because the data mining process can be repeatable in any cases. The process is flexible and can be applied on different types of data and used in any business user's area. It also provides a uniform framework as a guideline and documentation.

2.4.3 OLAP Mining

According to Song Lin (2002), the combination of OLAP and data mining then known as OLAM or OLAP Mining as a powerful tool for mining hidden patterns in a multidimensional database. OLAM provides suggestions to the decision-maker according to the internal model with few quantitative data mining methods, like regression or classification. Data mining have been introduced into OLAP and it is not involved for any development of data mining algorithm. Data mining is the process of applying intelligent methods to extract data patterns, provides automatic data analysis and prediction, gathers hiding pattern and predicts unknown information with data mining tools. Diagram 2.20 depicts the OLAP mining concept where multidimensional database integrates with data mining algorithm to produce effective reporting for decision makers.

Diagram 2.20 OLAP Mining Concept (adapted from Han, 1998)

According to Hans, J. (1997) Online Analytical processing Mining (OLAM) architecture provides modular and systematic design for data mining systems and ease constraint-based, multidimensional database mining on a data warehouse. In diagram 2.21, OLAM architecture consists of four layers. Layer 1 is the database layer consists of systematically constructed relational databases and performs data cleaning, integration, and consolidation in the building of data warehouses. Layer 2 is the multidimensional database layer, which offers a multidimensional database for online analytical processing (OLAP) and data mining. Layer 3 is the crucial layer for data mining as the OLAP and OLAM engines blends together for processing and mining of data. Lastly on layer 4 lays the graphical user interfaces which allow users to built data warehouses, multidimensional databases, perform OLAP and mining, and visualize and explore the results. A proficient OLAM architecture should use existing infrastructure in this way rather than constructing everything from scratch.

Diagram 2.21 Online Analytical Mining (adapted from Hans, 1998)

OLAM architecture benefits the OLAP based system as it provide exploratory data analysis environment using data mining technology. As depicts in Diagram 4.1, layer 1, 2 and 3, the integration from database, data warehouse, multidimensional database, OLAP and OLAM makes data mining possible in different subsets of data from different levels of abstractions by using drilling, pivoting, slicing and dicing a multidimensional database and intermediate data mining results. Also, it eases online interactive data mining functions such as decision trees and clustering for viewing the results with powerful and flexible knowledge visualization data mining tools.

According to Hans (1998), OLAM uses the integration of multiple data mining techniques such as association, classification and clustering to mine different portions of the data warehouses and at different levels of abstraction. Data mining can be done with programs that analyze the data automatically. In order to better understand customer behaviour and preferences, businesses are using data mining to pass through the huge amounts of information gathered. Vacca (2002) elaborated that OLAM are considered among the different concept and architectures for data mining systems. It combines OLAP with data mining and mining knowledge in multidimensional database cube.

2.4.4 SUMMARY

In this study on data mining model for descriptive and predictive model, classification using decision tree techniques and clustering techniques will be used. In this research, precision evaluation data mining benchmark will be use as to measure the accuracy of the results generated by the OLAP and data mining applications. Precision evaluation of data mining benchmark is rather important compared to performance evaluation data mining benchmark. The precision evaluation data mining algorithm will be use in study is Decision tress and clustering techniques from Microsoft Data Mining algorithm as shown in table 2.12. The experiments were based on different parameters of training datasets from school, e-commerce and medical datasets to prove that the accuracy of result generate by the data mining applications using OLAP and OLAM approach.

Table 2.12 The computation model of Ms Data Mining Algorithm

Algorithm

Features

Computation Model

Decision Tree

The decision tree is probably the most popular technique predictive modelling.

Using a Bayesian score as the default

Classification

Expectation and Maximization (EM)

This algorithm iterates between two steps. In the first step called the E or “expectations” step, the cluster membership of each case is calculated. In the second step, called the M or “Maximization” step, the parameters of the models are re estimated using these cluster membership.

Similar to K-means.

Clustering

2.5 LITERATURE REVIEW SUMMARY

According to Kramer (2002) author of “A Comparison of Data Warehousing Strategies and Product Offerings” from Green Hill Analysis conducted a product comparison ranges from Microsoft with Hyperion Solutions, ORACLE Corporation and IBM Corporation on the data warehouse offerings. Kramer evaluates and compares the Data Warehouse, Build & Manage, OLAP, Data Mining and API strategies between Microsoft Corporation, IBM Corporation, and Oracle Corporation. Table 3.3 summarizes Kramer findings.

Table 3.3 Kramer Findings from Product Comparison

Study on

Descriptions

Data Warehouse

Microsoft SQL Server, ORACLE i-series and IBM DB2 provide almost the same functionality for building Data Warehouse for OLAP/Data mining.

Build & Manage

For Build & Manage, Microsoft SQL server is the best choice because of its availability and also the background knowledge.

OLAP

Microsoft beats ORACLE & IBM for being the best OLAP implementer. Microsoft OLAP support design, build and manage OLAP in various storage models like MOLAP, ROLAP and HOLAP.

Data Mining

As ORACLE & IBM comes with a separate packages and pricing for data mining features while Microsoft bundles data mining inside analysis services of SQL server. Also, supported Data Mining algorithms such as Decision Trees, Clustering, association, forecasting and many more.

In this study, Microsoft data warehouse, OLAP and data mining will be as Microsoft provides a comprehensive data warehouse, OLAP and data mining platform. Build and manage capabilities, OLAP and data mining capabilities, and application interfaces are its key strength although data mining integration and data mining tools are quite good. Throughout this literature review, Microsoft are chosen because it is bundled “in the box” and also best choice for cost saving DB for high end DB solutions.