Data Warehouse Architecture And Implementation Choices Information Technology Essay

5334 words (21 pages) Essay

1st Jan 1970 Information Technology Reference this

Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Data warehousing is just a product, or collection of products – it is a solution. It is simple and helps us to get a solution for better decision-making, other than that, it is a part of global set (Chuck Ballard, Dirk Herreman 1998). From this point of view, data are extracts from the outside sources, transforms and then loaded into the destination database (data warehouse) repository. As a result, these technologies have been successfully implements in many industries for retail industries they used for user profiling and inventory system management (G.Satyanarayana Reddy et al 2010).

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Find out more

A consistent data store that provides physical implementation and information stores on which a project needs to make strategic decision that supports to decision making data model. Integrating the data from multiple heterogeneous sources, the design should be constructs that supports ad-hoc queries and analytical reports. Data warehouse offer Online Analytical Processing (OLAP) tool to analysis multidimensional data. The performance and functional requirements of OLAP are different, when compared to Online Transaction Processing application (OLTP) it supports operational databases.

The main objectives of this project are to improve the retailer sales more effectively and implement the various features and knowledge into that data warehouse by integrating the Online Analytical Processing technology. While integrating the OLAP technology we generate multidimensional analytical queries efficiently to real time transaction. In this project, OLAP technology plays an important role because it provides valid answers to business and management queries to take better decision making.

1.2 Road map

Chapter 2

Literature review

2.1 Overview

This chapter discusses a framework of the data warehouse technology obtained through researching the various journals and research articles. The important features of the data warehouse technology are Data Warehouse architecture and Implementation choices, Extract Transform Load (ETL), Data Warehouse Design Schema, and Multidimensional Modeling. The gained knowledge is puts into the project and provides a best solution to the today’s organization. The data model will be discuss and analyzed through an investigation study in the researched areas. This literature review will gives the impression and uses of the Online Analytical Processing system in data warehouse.

This section is also involves the existing technologies and tools used effectively for OLAP applications. In this topic, this is not a first project therefore other portion of the research and works done in related topics are studied. Sequentially, understand the problem in detail and look forward into modern emerging technologies used to give a more effective, structure and decision support knowledge for future developers.

2.2 History of data warehouse

In early 1980s, the origin of the data-warehousing concept can be trace, when relational database management system appeared as commercial products. Bill Inmon (1993) the “father of data warehousing” gives definition “A subject-oriented, integrated, non-volatile, time-variant collection of data organized to support management needs”. Ralph Kimball (1996) says, “A copy of transaction data specifically structured for query and analysis”. The ease of relational model, together with query capabilities offer by the SQL language then it is describe as decision support or end-user computing. The surrounding data are extracts from organisation’s database and stored in freshly created database system to supports reporting functions of all types and ad hoc end-user queries.

In the early days, the extracted operational data are snapshot or subsets. A decision support (end – user computing) database system loads the snapshots in regular basis once in a week or month. These snapshots will accumulate in the system with limited number of editions, access will be provides to end user with report tools and query. Decision support system was not much concern in data modelling. Data model for decision support system characteristically coordinated with the operational system data model because snapshots are extracted (Chuck Ballard, 1998).

2.3 Data warehouse & Data mart

A data warehouse is intended for analysis and query, instead of transaction processing. A relational database contains historical data copied from the business data. It allows a business to merge data from other sources and divides analysis workload commencing to transaction workload. The process of collecting and transferring data it into business users it controls a data warehouse frameworks such as ETL, OLAP and other applications. The data warehouse consists of three different types such as data mart, enterprise data warehouse and operational data store. A data warehouse has several dimension tables connects with the fact table that builds a dimension data model.

The main drawbacks in the data warehouse are very expensive system to implement because, it mainly focuses on organizations and data modeling rather than user needs. Therefore, data warehouse needs more source and time to implement because maintenance is currently varying to business situations (Review essays, 2011).

Figure 2.1 Data warehouse and data mart

Data warehouse

Data mart 1

Data mart 2

Data mart n

A data mart is subset of a data warehouse that mainly focuses on particular business sectors such as finance, sales, marketing and managements. A data can be obtained from the data warehouse is known as dependent data mart. A data can be gathered directly from the source is known as independent data mart. A data mart has particular business related function such as evaluating and calculating the sales performance, evaluating the company profits for newly launched product, evaluating the marketing promotions, evaluating and calculating the performance of new company branch (Firestone, 1997).

A data mart has many problems such as data access, size of the data, functionality, scalability and performance similar to any other system. However, in organization data marts can damaged into different departments and mainly concentrate on their individual needs it cause very difficult process in cleansing and data access. In case, organizations has a data marts in each departments consists of production, marketing, promotions, sales, and inventory combines the total profit information’s together in a single data mart can be confusing and irresistible (Review essays, 2011).

2.4 Data warehouse architecture and Implementation choices

There are three common architectural models developed for the implementations of data warehouses.

Top Down Architecture, Top Down Implementation,

Bottom Up Architecture, Bottom Up Implementation, and

Combined Approach. In this section, we look at each of them in detail.

2.4.1 Top down architecture, Top down implementation

Figure 2.2 shows the top down architecture where a data warehouse creates many data marts. These data mart provide faster access to user and user can analysis and generates reports quickly. Users can access the data warehouse without any barrier (Chowdhury & Bikramjit, 2010). At the beginning of the project, a top down implementation requires more designing and planning work. This implementation involves from each organization or business that will participates by the individuals. The data standards, data structures, decision concerning data sources and overall data model need to be complete before the actual implementation begins. This approach involves the most of the enterprise wide or corporate wide data warehouse with higher standard access to the data the business organizations and workgroups. This approach gives better result from the beginning stage such as more consistent data definitions and enforcement of business rules through all the organizations (Chuck Ballard, et al 1998).

Figure 2.2 Top down approach

Data warehouse

Data mart

Data mart

Data mart

User analysis and reporting

2.4.2 Bottom up architecture, Bottom up implementation

Figure 2.3 shows the bottom up architecture, union of all the data marts and data warehouse creates the data warehouse. This architecture is mainly use for analysis purpose; specified data can be available if staging process can go through in data warehouse (Chowdhury & Bikramjit, 2010).

Figure 2.3 Bottom up approach

Data warehouse

Data mart

Data mart

Data mart

User analysis and reporting

A bottom up implementation requires planning and designing of data marts that should not be waiting for global infrastructure. Initial data mart implementation can expand and builds incrementally. Now, this approach is extensively accept rather than the top down approach, because we get data mart results immediately and used as justification for expanding additional global implementation. Many organizations prefer this approach for faster payback, especially in business management. Data marts have a smaller amount of complex design rather than global data warehouse and it enables faster results. Typically, initial implementation is very costly in source and hardware than organizing the large-scale data warehouse (Chuck Ballard, et al 1998).

2.4.3 Combined Approach

When implementing both the top down or bottom up approach, we notice positive and negative considerations, but in many case combination of the two is best approach. It is difficult to balancing, but a good developer can be done. The main part of this approach is to determine the level of designing and planning is required to support integration for global approach as data marts built in bottom up approach. A data warehouse implementation has many issues to be determined by using the combined approach it can be enable declaration of these issues as they encountered and data mart has less scope rather than global data warehouse. Implementation process could be monitoring carefully and management issues could result the gaining of both the techniques (Chuck Ballard, et al 1998).

2.5 Extract Transform Load (ETL)

According to the industry, approximately 60 – 70% of the data-warehousing project effort is spent for ETL process only the rest of the percentage is spent for analysis and reporting (Vishal et al 2010). ETL is a data integration function that involves extracting the data from original source, quality assuring and cleaning data, conforming the labels and measures, delivering data in a physical format that can be useful for report writers (Kimball 2009). Generally, companies use this process to solve the problem using this extract, transform and load (ETL) technology, which includes reading data from outside source, cleaning it up and formatting it uniformly and then finally loading it into a target repository (data warehouse). The source file of the ETL process comes from any source such as flat file, Excel spreadsheet, Mainframe application, ERP application or a CRM tool (Vishal, et al 2010).

Occasionally, data in staging area is protects to support functionality that requires history while in other times the data is deleted with each process. In staging area history is maintains, it is frequently referred to as a persistent staging area. If each load the data is deletes, the area should be considers as transient (Kimball, 2009).

In ETL process, each system ensures the overall effort efficiently, determines for each component and synchronizes how the tools will be use for all ETL events. Every organization should have ETL expert they can ensure that ETL process have more strength and patience. Three processes can be perform and combined into one tool that will make routine process to pull data out from one database and put it into another database. The database functions should be describes as extract, transform and load shown in the below Figure 2.4. Extraction is the first step; this process reads data from outside source database and extracting a desired subset of data, which includes in this process (Vishal et al 2010). Extracting data different platform is challenging one, as data source can be different DBMS (Kimball 2009). The main steps in ETL system are cleaning and conforming here accurate data can be defined and data should be correct, unambiguous, consistent, and completeness (Kimball 2009).

The second step is transform, here the process of converting the extracted, cleaning, conforming the acquired data from its previous form that can be placed it into another database. Combines with other data, look up tables or rules, transformations will occurs (Vishal et al 2010). In this process, transformation objects can extend and combines to build new, reusable objects, and context independent (Wayne and Colin, 2003). Final step of ETL process is writing the data into target (destination) database (Vishal et al 2010). While implementing the ETL process there are many problems a developer can face such as critical data can be missing in source system, technical challenges like integrating and transforming data from different environments, query performance is poor, difficult to maintain business rule, long load times (Vishal et al 2010).

Figure 2.4 Extract, Transform, Load (ETL)

Sources

Targets

ETL

App 1

App 2

App 3

Flat Files

XML Unstructured

Flat Files

XML

Data warehouse

Data mart

Nowadays organization needs more sources into several of distributed (Business Intelligence) BI applications. To overcome this data flow, ETL needs to develop from batch oriented, extracts and loads data in continuous, capture data in real time. Finally, BI gains a huge part of solutions by integrating data cleansing and profiling capabilities.

2.5.1 Data cleaning

In data warehouse, data cleaning is required when integrating the mixed data sources together with related data transformations. In data warehousing, data cleaning is the very important part of that project and so-called ETL process. Generally, data cleaning is otherwise known as scrubbing or cleansing. In order to improve the data quality, cleansing is used it deals with identifying and removing errors. In files and database, data quality problem are present such as invalid data, duplicate values, data are misspelling, and missing value etc. Multiple data source can be integrate in federated database systems, here cleansing process needs significantly due to source contains redundant data.

Cleansing provide extensive supports for data warehouse in order to improve the quality of data. Huge amount of data can loads and refresh incessantly from a different source, here the source contains some “dirty data”. In case, missing information and duplicate values produces incorrect information. However, data warehouse are use for better decision making, so incorrectness of their data provides wrong conclusions for business and managements (Erhard, 2000).

2.6 Data Warehouse Schema

The next part of this chapter offers strength in significant analysis of the three data warehouse designs. They are Entity relationship schema, Star schema, Snowflake schema; in this section, we can explain in detail. Due to space restriction, other data warehouse designs could not be calculate in the depth. However, in data warehouse designs, there is a lack of research but important works published over the few years back.

2.6.1 Entity Relationship Schema

In entity relationship schema, a data model can generate in the specific areas of interest using the two fundamental concepts in the ER schema are entities and relationships between the individual entities. Attributes also contains in the ER models that could be fits in either the entities or the relationships. However, an entity relationship model is an abstraction tool since it can be use to simplify and understand the unclear data relationship in the difficult system environment and business world.

The key factor of the ER model is describing the unique identifier of an entity is the major critical job. Here candidate key is describes as unique identifier from that select the key which is used to identify the entity it is known as primary key. In the ER diagram, the relationship is denotes with the lines connected between two entities (Chuck, et al 1998).

Figure 2.5 Entity Relationship Schemas

Entity

Relationship

Entity 1

Primary key 1

Entity 2

Primary key 2

Foreign key 1

Entity 4

Primary key 4

Foreign key 4

Primary key 3

Foreign key 2

Foreign key 3

Entity 3

Attribute

Figure 2.5 shows the entity relationship schema it can be use to symbolize entities within an organization. It represents the primary keys and foreign keys relationship between the two entities. In ER schema, the dimension table and fact table do not have any features, which is uniqueness of multidimensional modeling.

The motivation of the ER modeling is to protect the data integrity and store only once; maintenance is easy and provides flexibility, data elements can be stored once, process is achieves by dropping the data redundancy. In ER schema, these models are good for online transaction processing and make it simple transactions as feasible (Yashvardhan, 2011). The major problem in ER model is end user cannot remember or recognized, ad hoc query are not optimized, difficult to browsing, due to complex ER designs many data warehouse have failed.

2.6.2 Star Schema

In Data Warehousing, the most popular design is the star schema it represent multidimensional data. The essential structure for a dimension table is star schema (Chuck, et al 1998). Bringing together the fact table and dimension tables in a star like structure are usually called as the star join schema (Kimball and Margy, 2002). As Figure 2.6 shows, the star schema has only one fact table, which is star like structure connected with group of dimension tables surrounding to that fact table. Each dimension table has one primary key, which is connect to the fact table as foreign key.

Figure 2.6 Star Schema

Dimension table 1

Primary key 1

Dimension table 2

Primary key 2

Dimension table 4

Primary key 4

Dimension table 3

Primary key 3

Fact table

FK 1

FK 2

FK 3

FK 4

The benefit of star schema is simple design, easy to define hierarchy and understand. The star schema is fully denormalized structure when compared to snowflake schema. In snowflake schema, a dimension table have one or more parent table, but in star, there is no parent table. In snowflake schema each attribute will creates a separate table, there is possible for additional joins and performance humiliates. However, it is not possible in star model because hierarchies for the dimension tables are stored itself. The main drawback is huge dimension tables, poor performance in fact table for summary data (Enrico, 2006).

2.6.3 Snowflake Schema

In data warehouse, snowflake is the extremely popular in the data warehouse designs. However, snowflake schema thinks to be cooperated between a star schema and ER schema. Generally, snowflake schema is an extension of the star schema. It is a variation of star schema in which dimension tables structured into a hierarchy by normalizing them. Figures 2.7 shows the snowflake schema in this sketch the star schema is at the center. In snowflake schema, some mechanism in the data warehouse is normalizing up to third normal form. In this diagram, dimension tables hierarchies are splits into one or more sub dimension table (Umashanker, et al 2006).

These sub dimension table can increase the normalized table number and supports different level of hierarchies in the snowflake schema. In 1: M relationship way from the sub dimensional table to the fact table corresponds to the dimensional hierarchy.

Figure 2.7 Snowflake Schema

Dimension 1

Dimension 2

Dimension 5

Fact table

Dimension 3

Dimension 6

Dimension 4

Dimension 8

Dimension 9

Dimension 7

The main advantage of snowflake schema is easy to maintain and increases the flexibility. In several cases, the snowflake schema can increase the performance, as little table are joined. The major drawbacks are end user can work with multiple numbers of tables. Difficult to create the queries, because here we used more tables need to be join (Crag, 2007). In this selection of a modeling technique, most of the database designers do not worried about the space saving to be a most important decision criteria (Chuck, et al 1998).

Find out how UKEssays.com can help you!

Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.

View our services

2.7 Multidimensional Modeling

Kimball established the multidimensional database technology in the year 1998 (). Recently this technology reaches the mass market, currently distributes the multidimensional mechanism along with their relational database with major vendors. This technology has some major advantages in maturity and scalability. The main objective of the multidimensional data model is analyzing, instead of performing the online transactions. This technology is focuses on three major key concepts such as dimensions, cubes and measures and modeling the business rules. Multidimensional model is a main part of decision-making process for analyzing the huge amount of data. A multidimensional database model is establishes on relational factor and dimension relations are developed from the dimensions (Umashanker, et al 2006).

However, in Online Analytical Processing (OLAP) database requires summarize data at several stage of features at different combinations of attributes. In data warehouse, a data cube is creates from a subset of attributes, then it is called as multidimensional modeling. Generally, particular attributes will be select to measure values, while other attributes choose as dimensions (Antoaneta, et al 2004). In business-focused environment, the multidimensional model changes the idea of representation into more details to business sectors. All the key concepts such as cubes, measures and dimensions relate with each other that offer a reporting environment. In multidimensional model, user requires extra calculations rapidly and without any difficult. Typically, multidimensional database systems used for decision support purpose and business analysis process further classified into two divisions. The first process is relational database systems it build multidimensional schemas like snowflake and star schema. The second process is specifically designs for online purpose only (Umashanker, et al 2006).

2.7.1 Goals of Multidimensional Model

Information’s is offering to the end-user that corresponds to usual perceptive business, facts or figures from the different point of views that influence them. It offers physical implementation when the software recognizes the program to access the data easily. In multidimensional database, the basic modeling principles are, first the factual table is analyzed which includes a few primary storage area and facts. Dimension table links with primary code in the primary yards, while in fact data will represents the real measured values (Umashanker, et al 2006).

2.7.2 Application Areas

Data warehouse has huge repositories, which integrates data from various sources for analysis in a project. To find the overall tendency and efficiency of the system, online analytical processing (OLAP) gives best solution for queries that combines huge amount of data. As a result, factual table builds to the smallest levels. Dimension table decrease the fact table size and amount of duplicated data in several cases. The relationship between dimension and fact table is many to many, it requires snowflake form through normal dimensions. In snowflake method, it is use to minimize the dimension table figures, which is connected to other dimension table, which links with the main fact table. In fact table, checking data should be unique and it should not be summarized. Finally, these data should be at the constant level with the essential size. Monitoring data should be guarantee and all the checking information present in the fact table, which appears in dimension table when designing.

In addition, multidimensional model has three significant application regions within the data analysis. Before planning the multidimensional database system, first select the modeling process, which includes the functional requirements. After that verify the numeric values to find the functions. Dimensions are analysis and assembled to the perspectives to verify the checking data and values. At last, database should be defines in logical model, which is determined in the object analysis, perspectives (Umashanker, et al 2006).

2.7.3 Logical Multidimensional Model

The multidimensional database technology is very important in business regions, because it enforces simplicity. The multidimensional model is organizes to logical cubes, dimensions, measures, levels, hierarchies and attributes. This model is very simple because object should be defined that represents to the real-world business units. However, analysts know which dimensions and attributes gives the meaningful data, which industry measures they gives more importance and they know how the dimensions managed into hierarchies and level of their business. The essential logical models for online analytical processing (OLAP) applications are multidimensional data cubes and these OLAP tool offer analysis to the original information. To achieve the main objective, these tools utilize multidimensional database model for the data presentation and storage (Umashanker, et al 2006).

Figure 2.8 Logical Multidimensional Models

Cube

Measures

Dimension

Dimension Attributes

Levels

Hierarchies

The figure 2.8 shows the logical model for cubes that is not self-existing entities, instead of original data set. To organize the measures, those have the same figures that have the exact similar dimensions that provide logical cubes. Similar cube measures have the identical relationship to other objects and it can be displays and analyzed together easily. It allows user to interact with each dimension table and decides which values demonstrate in the destination presentation areas, here dimensional queries established number of steps. In relational model, users to operate all the total components, which guides to unexpected result sets and misunderstanding (Umashanker, et al 2006).

2.8 Online Analytical Processing (OLAP)

Online Analytical Processing (OLAP) allows the project and selects i.e. slice and dice data to the business users. The most important OLAP vendors are Cognos, Business Objects, and Hyperion. The background of the OLAP environment is to perform the analysis that involves large investments in financial resources and time (Mailvaganam, 2007). It uses a multidimensional data format (data cubes) to make easy query and response time. In OLAP technology, data aggregation and data model techniques manage and reviewed huge amount of data, therefore it can be calculated rapidly using graphic tools and online analysis. An OLAP system supports real time analysis that offers the flexibility.

Generally, in organization data are spreads into several data sources, which are mismatches with each other. In retail example: The sales and Point-of-sales data gathered in different place and arrangements made through web or call-center. The marketing team will get the OLAP reports such as customer between the ages 18 to 25, which products they prefer more it would be the time consuming process. Extracting the data from different data storehouse and making them well matched it is also the part of OLAP implementation. Example for incompatible data: Customers purchase made and age can be stored as birth date in age categories between 18 and 25 over the web. In operational system, data are stored such as point-of-sales in database types called as Online Transaction Processing (OLTP) c The major differences between the OLTP and OLAP system design can be discussed in the below table 1.1

Table 1.1 Differences between OLTP and OLAP

Item

OLTP System

Online Transaction Processing

(Operational System)

OLAP System

Online Analytical Processing

(Data Warehouse)

Data source

Operational data: original source of the data

Consolidation data: data comes from the several OLTP database

Purpose of data

To organize and manage fundamental tasks

Helps for decision support, planning and problem solving

User

IT Professionals

Knowledge worker

Database design

Application oriented

Subject oriented

Database size

100 MB-GB

100 GB-TB

Processing speed

Very fast

Depends upon the data involved

Queries

Simple queries and relatively standard

Complex queries

Access

Read and write

Read only

Backup recovery

To run the business, operational data is serious, needs data loss, major financial loss and legal ability

Instead of normal backups, many places might simply reload the data as recovery method.

The OLAP typical operations include roll up (drill-up) it increases the aggregation level, drill down (roll down) it decreases the aggregation level together with one or more dimension hierarchy, slice and dice (select and project), and pivot makes cross chart (Chaudhuri and Dayal, 1997). The key features of the OLAP applications found in different functional areas such as multidimensional data view, calculation-intensive capabilities and time intelligence (Aparajita, 2004).

Figure 2.9 Steps for OLAP Creation Process

Scheduled automated Processes Extract Data from OLTPs

Build cubes

Produce reports

Transform and Standardize data

Import to OLAP Database

Extract Data from OLTPs

When designing the solution, OLAP storage is one of the significant choices to be complete. It has three types we can look each of them in detail.

Multidimensional OLAP

Relational OLAP

Hybrid OLAP

2.8.1 Multidimensional OLAP

In MOLAP, source data can be stored in multidimensional cubes and not in relational database and aggregations also stored in the same format. In MOLAP, data retrieval is very fast, although it needs more disk space. Disk space is fewer amounts to worry in these days with poor storage and processing cost (Mailvaganam, 2007). The main advantages are when the cube is generates and querying data can be apply easily for all the calculations are pre-generated. The major drawback of this model is handle only few amount of data.

2.8.2 Relational OLAP

In ROLAP, all the data sources and aggregations are stored in the relational database. For significant data warehouse implementation, this method is worried for higher usage requires. In ROLAP, data retrieval is slow whether an aggregation is present or not. The relational database should

Data warehousing is just a product, or collection of products – it is a solution. It is simple and helps us to get a solution for better decision-making, other than that, it is a part of global set (Chuck Ballard, Dirk Herreman 1998). From this point of view, data are extracts from the outside sources, transforms and then loaded into the destination database (data warehouse) repository. As a result, these technologies have been successfully implements in many industries for retail industries they used for user profiling and inventory system management (G.Satyanarayana Reddy et al 2010).

A consistent data store that provides physical implementation and information stores on which a project needs to make strategic decision that supports to decision making data model. Integrating the data from multiple heterogeneous sources, the design should be constructs that supports ad-hoc queries and analytical reports. Data warehouse offer Online Analytical Processing (OLAP) tool to analysis multidimensional data. The performance and functional requirements of OLAP are different, when compared to Online Transaction Processing application (OLTP) it supports operational databases.

The main objectives of this project are to improve the retailer sales more effectively and implement the various features and knowledge into that data warehouse by integrating the Online Analytical Processing technology. While integrating the OLAP technology we generate multidimensional analytical queries efficiently to real time transaction. In this project, OLAP technology plays an important role because it provides valid answers to business and management queries to take better decision making.

1.2 Road map

Chapter 2

Literature review

2.1 Overview

This chapter discusses a framework of the data warehouse technology obtained through researching the various journals and research articles. The important features of the data warehouse technology are Data Warehouse architecture and Implementation choices, Extract Transform Load (ETL), Data Warehouse Design Schema, and Multidimensional Modeling. The gained knowledge is puts into the project and provides a best solution to the today’s organization. The data model will be discuss and analyzed through an investigation study in the researched areas. This literature review will gives the impression and uses of the Online Analytical Processing system in data warehouse.

This section is also involves the existing technologies and tools used effectively for OLAP applications. In this topic, this is not a first project therefore other portion of the research and works done in related topics are studied. Sequentially, understand the problem in detail and look forward into modern emerging technologies used to give a more effective, structure and decision support knowledge for future developers.

2.2 History of data warehouse

In early 1980s, the origin of the data-warehousing concept can be trace, when relational database management system appeared as commercial products. Bill Inmon (1993) the “father of data warehousing” gives definition “A subject-oriented, integrated, non-volatile, time-variant collection of data organized to support management needs”. Ralph Kimball (1996) says, “A copy of transaction data specifically structured for query and analysis”. The ease of relational model, together with query capabilities offer by the SQL language then it is describe as decision support or end-user computing. The surrounding data are extracts from organisation’s database and stored in freshly created database system to supports reporting functions of all types and ad hoc end-user queries.

In the early days, the extracted operational data are snapshot or subsets. A decision support (end – user computing) database system loads the snapshots in regular basis once in a week or month. These snapshots will accumulate in the system with limited number of editions, access will be provides to end user with report tools and query. Decision support system was not much concern in data modelling. Data model for decision support system characteristically coordinated with the operational system data model because snapshots are extracted (Chuck Ballard, 1998).

2.3 Data warehouse & Data mart

A data warehouse is intended for analysis and query, instead of transaction processing. A relational database contains historical data copied from the business data. It allows a business to merge data from other sources and divides analysis workload commencing to transaction workload. The process of collecting and transferring data it into business users it controls a data warehouse frameworks such as ETL, OLAP and other applications. The data warehouse consists of three different types such as data mart, enterprise data warehouse and operational data store. A data warehouse has several dimension tables connects with the fact table that builds a dimension data model.

The main drawbacks in the data warehouse are very expensive system to implement because, it mainly focuses on organizations and data modeling rather than user needs. Therefore, data warehouse needs more source and time to implement because maintenance is currently varying to business situations (Review essays, 2011).

Figure 2.1 Data warehouse and data mart

Data warehouse

Data mart 1

Data mart 2

Data mart n

A data mart is subset of a data warehouse that mainly focuses on particular business sectors such as finance, sales, marketing and managements. A data can be obtained from the data warehouse is known as dependent data mart. A data can be gathered directly from the source is known as independent data mart. A data mart has particular business related function such as evaluating and calculating the sales performance, evaluating the company profits for newly launched product, evaluating the marketing promotions, evaluating and calculating the performance of new company branch (Firestone, 1997).

A data mart has many problems such as data access, size of the data, functionality, scalability and performance similar to any other system. However, in organization data marts can damaged into different departments and mainly concentrate on their individual needs it cause very difficult process in cleansing and data access. In case, organizations has a data marts in each departments consists of production, marketing, promotions, sales, and inventory combines the total profit information’s together in a single data mart can be confusing and irresistible (Review essays, 2011).

2.4 Data warehouse architecture and Implementation choices

There are three common architectural models developed for the implementations of data warehouses.

Top Down Architecture, Top Down Implementation,

Bottom Up Architecture, Bottom Up Implementation, and

Combined Approach. In this section, we look at each of them in detail.

2.4.1 Top down architecture, Top down implementation

Figure 2.2 shows the top down architecture where a data warehouse creates many data marts. These data mart provide faster access to user and user can analysis and generates reports quickly. Users can access the data warehouse without any barrier (Chowdhury & Bikramjit, 2010). At the beginning of the project, a top down implementation requires more designing and planning work. This implementation involves from each organization or business that will participates by the individuals. The data standards, data structures, decision concerning data sources and overall data model need to be complete before the actual implementation begins. This approach involves the most of the enterprise wide or corporate wide data warehouse with higher standard access to the data the business organizations and workgroups. This approach gives better result from the beginning stage such as more consistent data definitions and enforcement of business rules through all the organizations (Chuck Ballard, et al 1998).

Figure 2.2 Top down approach

Data warehouse

Data mart

Data mart

Data mart

User analysis and reporting

2.4.2 Bottom up architecture, Bottom up implementation

Figure 2.3 shows the bottom up architecture, union of all the data marts and data warehouse creates the data warehouse. This architecture is mainly use for analysis purpose; specified data can be available if staging process can go through in data warehouse (Chowdhury & Bikramjit, 2010).

Figure 2.3 Bottom up approach

Data warehouse

Data mart

Data mart

Data mart

User analysis and reporting

A bottom up implementation requires planning and designing of data marts that should not be waiting for global infrastructure. Initial data mart implementation can expand and builds incrementally. Now, this approach is extensively accept rather than the top down approach, because we get data mart results immediately and used as justification for expanding additional global implementation. Many organizations prefer this approach for faster payback, especially in business management. Data marts have a smaller amount of complex design rather than global data warehouse and it enables faster results. Typically, initial implementation is very costly in source and hardware than organizing the large-scale data warehouse (Chuck Ballard, et al 1998).

2.4.3 Combined Approach

When implementing both the top down or bottom up approach, we notice positive and negative considerations, but in many case combination of the two is best approach. It is difficult to balancing, but a good developer can be done. The main part of this approach is to determine the level of designing and planning is required to support integration for global approach as data marts built in bottom up approach. A data warehouse implementation has many issues to be determined by using the combined approach it can be enable declaration of these issues as they encountered and data mart has less scope rather than global data warehouse. Implementation process could be monitoring carefully and management issues could result the gaining of both the techniques (Chuck Ballard, et al 1998).

2.5 Extract Transform Load (ETL)

According to the industry, approximately 60 – 70% of the data-warehousing project effort is spent for ETL process only the rest of the percentage is spent for analysis and reporting (Vishal et al 2010). ETL is a data integration function that involves extracting the data from original source, quality assuring and cleaning data, conforming the labels and measures, delivering data in a physical format that can be useful for report writers (Kimball 2009). Generally, companies use this process to solve the problem using this extract, transform and load (ETL) technology, which includes reading data from outside source, cleaning it up and formatting it uniformly and then finally loading it into a target repository (data warehouse). The source file of the ETL process comes from any source such as flat file, Excel spreadsheet, Mainframe application, ERP application or a CRM tool (Vishal, et al 2010).

Occasionally, data in staging area is protects to support functionality that requires history while in other times the data is deleted with each process. In staging area history is maintains, it is frequently referred to as a persistent staging area. If each load the data is deletes, the area should be considers as transient (Kimball, 2009).

In ETL process, each system ensures the overall effort efficiently, determines for each component and synchronizes how the tools will be use for all ETL events. Every organization should have ETL expert they can ensure that ETL process have more strength and patience. Three processes can be perform and combined into one tool that will make routine process to pull data out from one database and put it into another database. The database functions should be describes as extract, transform and load shown in the below Figure 2.4. Extraction is the first step; this process reads data from outside source database and extracting a desired subset of data, which includes in this process (Vishal et al 2010). Extracting data different platform is challenging one, as data source can be different DBMS (Kimball 2009). The main steps in ETL system are cleaning and conforming here accurate data can be defined and data should be correct, unambiguous, consistent, and completeness (Kimball 2009).

The second step is transform, here the process of converting the extracted, cleaning, conforming the acquired data from its previous form that can be placed it into another database. Combines with other data, look up tables or rules, transformations will occurs (Vishal et al 2010). In this process, transformation objects can extend and combines to build new, reusable objects, and context independent (Wayne and Colin, 2003). Final step of ETL process is writing the data into target (destination) database (Vishal et al 2010). While implementing the ETL process there are many problems a developer can face such as critical data can be missing in source system, technical challenges like integrating and transforming data from different environments, query performance is poor, difficult to maintain business rule, long load times (Vishal et al 2010).

Figure 2.4 Extract, Transform, Load (ETL)

Sources

Targets

ETL

App 1

App 2

App 3

Flat Files

XML Unstructured

Flat Files

XML

Data warehouse

Data mart

Nowadays organization needs more sources into several of distributed (Business Intelligence) BI applications. To overcome this data flow, ETL needs to develop from batch oriented, extracts and loads data in continuous, capture data in real time. Finally, BI gains a huge part of solutions by integrating data cleansing and profiling capabilities.

2.5.1 Data cleaning

In data warehouse, data cleaning is required when integrating the mixed data sources together with related data transformations. In data warehousing, data cleaning is the very important part of that project and so-called ETL process. Generally, data cleaning is otherwise known as scrubbing or cleansing. In order to improve the data quality, cleansing is used it deals with identifying and removing errors. In files and database, data quality problem are present such as invalid data, duplicate values, data are misspelling, and missing value etc. Multiple data source can be integrate in federated database systems, here cleansing process needs significantly due to source contains redundant data.

Cleansing provide extensive supports for data warehouse in order to improve the quality of data. Huge amount of data can loads and refresh incessantly from a different source, here the source contains some “dirty data”. In case, missing information and duplicate values produces incorrect information. However, data warehouse are use for better decision making, so incorrectness of their data provides wrong conclusions for business and managements (Erhard, 2000).

2.6 Data Warehouse Schema

The next part of this chapter offers strength in significant analysis of the three data warehouse designs. They are Entity relationship schema, Star schema, Snowflake schema; in this section, we can explain in detail. Due to space restriction, other data warehouse designs could not be calculate in the depth. However, in data warehouse designs, there is a lack of research but important works published over the few years back.

2.6.1 Entity Relationship Schema

In entity relationship schema, a data model can generate in the specific areas of interest using the two fundamental concepts in the ER schema are entities and relationships between the individual entities. Attributes also contains in the ER models that could be fits in either the entities or the relationships. However, an entity relationship model is an abstraction tool since it can be use to simplify and understand the unclear data relationship in the difficult system environment and business world.

The key factor of the ER model is describing the unique identifier of an entity is the major critical job. Here candidate key is describes as unique identifier from that select the key which is used to identify the entity it is known as primary key. In the ER diagram, the relationship is denotes with the lines connected between two entities (Chuck, et al 1998).

Figure 2.5 Entity Relationship Schemas

Entity

Relationship

Entity 1

Primary key 1

Entity 2

Primary key 2

Foreign key 1

Entity 4

Primary key 4

Foreign key 4

Primary key 3

Foreign key 2

Foreign key 3

Entity 3

Attribute

Figure 2.5 shows the entity relationship schema it can be use to symbolize entities within an organization. It represents the primary keys and foreign keys relationship between the two entities. In ER schema, the dimension table and fact table do not have any features, which is uniqueness of multidimensional modeling.

The motivation of the ER modeling is to protect the data integrity and store only once; maintenance is easy and provides flexibility, data elements can be stored once, process is achieves by dropping the data redundancy. In ER schema, these models are good for online transaction processing and make it simple transactions as feasible (Yashvardhan, 2011). The major problem in ER model is end user cannot remember or recognized, ad hoc query are not optimized, difficult to browsing, due to complex ER designs many data warehouse have failed.

2.6.2 Star Schema

In Data Warehousing, the most popular design is the star schema it represent multidimensional data. The essential structure for a dimension table is star schema (Chuck, et al 1998). Bringing together the fact table and dimension tables in a star like structure are usually called as the star join schema (Kimball and Margy, 2002). As Figure 2.6 shows, the star schema has only one fact table, which is star like structure connected with group of dimension tables surrounding to that fact table. Each dimension table has one primary key, which is connect to the fact table as foreign key.

Figure 2.6 Star Schema

Dimension table 1

Primary key 1

Dimension table 2

Primary key 2

Dimension table 4

Primary key 4

Dimension table 3

Primary key 3

Fact table

FK 1

FK 2

FK 3

FK 4

The benefit of star schema is simple design, easy to define hierarchy and understand. The star schema is fully denormalized structure when compared to snowflake schema. In snowflake schema, a dimension table have one or more parent table, but in star, there is no parent table. In snowflake schema each attribute will creates a separate table, there is possible for additional joins and performance humiliates. However, it is not possible in star model because hierarchies for the dimension tables are stored itself. The main drawback is huge dimension tables, poor performance in fact table for summary data (Enrico, 2006).

2.6.3 Snowflake Schema

In data warehouse, snowflake is the extremely popular in the data warehouse designs. However, snowflake schema thinks to be cooperated between a star schema and ER schema. Generally, snowflake schema is an extension of the star schema. It is a variation of star schema in which dimension tables structured into a hierarchy by normalizing them. Figures 2.7 shows the snowflake schema in this sketch the star schema is at the center. In snowflake schema, some mechanism in the data warehouse is normalizing up to third normal form. In this diagram, dimension tables hierarchies are splits into one or more sub dimension table (Umashanker, et al 2006).

These sub dimension table can increase the normalized table number and supports different level of hierarchies in the snowflake schema. In 1: M relationship way from the sub dimensional table to the fact table corresponds to the dimensional hierarchy.

Figure 2.7 Snowflake Schema

Dimension 1

Dimension 2

Dimension 5

Fact table

Dimension 3

Dimension 6

Dimension 4

Dimension 8

Dimension 9

Dimension 7

The main advantage of snowflake schema is easy to maintain and increases the flexibility. In several cases, the snowflake schema can increase the performance, as little table are joined. The major drawbacks are end user can work with multiple numbers of tables. Difficult to create the queries, because here we used more tables need to be join (Crag, 2007). In this selection of a modeling technique, most of the database designers do not worried about the space saving to be a most important decision criteria (Chuck, et al 1998).

2.7 Multidimensional Modeling

Kimball established the multidimensional database technology in the year 1998 (). Recently this technology reaches the mass market, currently distributes the multidimensional mechanism along with their relational database with major vendors. This technology has some major advantages in maturity and scalability. The main objective of the multidimensional data model is analyzing, instead of performing the online transactions. This technology is focuses on three major key concepts such as dimensions, cubes and measures and modeling the business rules. Multidimensional model is a main part of decision-making process for analyzing the huge amount of data. A multidimensional database model is establishes on relational factor and dimension relations are developed from the dimensions (Umashanker, et al 2006).

However, in Online Analytical Processing (OLAP) database requires summarize data at several stage of features at different combinations of attributes. In data warehouse, a data cube is creates from a subset of attributes, then it is called as multidimensional modeling. Generally, particular attributes will be select to measure values, while other attributes choose as dimensions (Antoaneta, et al 2004). In business-focused environment, the multidimensional model changes the idea of representation into more details to business sectors. All the key concepts such as cubes, measures and dimensions relate with each other that offer a reporting environment. In multidimensional model, user requires extra calculations rapidly and without any difficult. Typically, multidimensional database systems used for decision support purpose and business analysis process further classified into two divisions. The first process is relational database systems it build multidimensional schemas like snowflake and star schema. The second process is specifically designs for online purpose only (Umashanker, et al 2006).

2.7.1 Goals of Multidimensional Model

Information’s is offering to the end-user that corresponds to usual perceptive business, facts or figures from the different point of views that influence them. It offers physical implementation when the software recognizes the program to access the data easily. In multidimensional database, the basic modeling principles are, first the factual table is analyzed which includes a few primary storage area and facts. Dimension table links with primary code in the primary yards, while in fact data will represents the real measured values (Umashanker, et al 2006).

2.7.2 Application Areas

Data warehouse has huge repositories, which integrates data from various sources for analysis in a project. To find the overall tendency and efficiency of the system, online analytical processing (OLAP) gives best solution for queries that combines huge amount of data. As a result, factual table builds to the smallest levels. Dimension table decrease the fact table size and amount of duplicated data in several cases. The relationship between dimension and fact table is many to many, it requires snowflake form through normal dimensions. In snowflake method, it is use to minimize the dimension table figures, which is connected to other dimension table, which links with the main fact table. In fact table, checking data should be unique and it should not be summarized. Finally, these data should be at the constant level with the essential size. Monitoring data should be guarantee and all the checking information present in the fact table, which appears in dimension table when designing.

In addition, multidimensional model has three significant application regions within the data analysis. Before planning the multidimensional database system, first select the modeling process, which includes the functional requirements. After that verify the numeric values to find the functions. Dimensions are analysis and assembled to the perspectives to verify the checking data and values. At last, database should be defines in logical model, which is determined in the object analysis, perspectives (Umashanker, et al 2006).

2.7.3 Logical Multidimensional Model

The multidimensional database technology is very important in business regions, because it enforces simplicity. The multidimensional model is organizes to logical cubes, dimensions, measures, levels, hierarchies and attributes. This model is very simple because object should be defined that represents to the real-world business units. However, analysts know which dimensions and attributes gives the meaningful data, which industry measures they gives more importance and they know how the dimensions managed into hierarchies and level of their business. The essential logical models for online analytical processing (OLAP) applications are multidimensional data cubes and these OLAP tool offer analysis to the original information. To achieve the main objective, these tools utilize multidimensional database model for the data presentation and storage (Umashanker, et al 2006).

Figure 2.8 Logical Multidimensional Models

Cube

Measures

Dimension

Dimension Attributes

Levels

Hierarchies

The figure 2.8 shows the logical model for cubes that is not self-existing entities, instead of original data set. To organize the measures, those have the same figures that have the exact similar dimensions that provide logical cubes. Similar cube measures have the identical relationship to other objects and it can be displays and analyzed together easily. It allows user to interact with each dimension table and decides which values demonstrate in the destination presentation areas, here dimensional queries established number of steps. In relational model, users to operate all the total components, which guides to unexpected result sets and misunderstanding (Umashanker, et al 2006).

2.8 Online Analytical Processing (OLAP)

Online Analytical Processing (OLAP) allows the project and selects i.e. slice and dice data to the business users. The most important OLAP vendors are Cognos, Business Objects, and Hyperion. The background of the OLAP environment is to perform the analysis that involves large investments in financial resources and time (Mailvaganam, 2007). It uses a multidimensional data format (data cubes) to make easy query and response time. In OLAP technology, data aggregation and data model techniques manage and reviewed huge amount of data, therefore it can be calculated rapidly using graphic tools and online analysis. An OLAP system supports real time analysis that offers the flexibility.

Generally, in organization data are spreads into several data sources, which are mismatches with each other. In retail example: The sales and Point-of-sales data gathered in different place and arrangements made through web or call-center. The marketing team will get the OLAP reports such as customer between the ages 18 to 25, which products they prefer more it would be the time consuming process. Extracting the data from different data storehouse and making them well matched it is also the part of OLAP implementation. Example for incompatible data: Customers purchase made and age can be stored as birth date in age categories between 18 and 25 over the web. In operational system, data are stored such as point-of-sales in database types called as Online Transaction Processing (OLTP) c The major differences between the OLTP and OLAP system design can be discussed in the below table 1.1

Table 1.1 Differences between OLTP and OLAP

Item

OLTP System

Online Transaction Processing

(Operational System)

OLAP System

Online Analytical Processing

(Data Warehouse)

Data source

Operational data: original source of the data

Consolidation data: data comes from the several OLTP database

Purpose of data

To organize and manage fundamental tasks

Helps for decision support, planning and problem solving

User

IT Professionals

Knowledge worker

Database design

Application oriented

Subject oriented

Database size

100 MB-GB

100 GB-TB

Processing speed

Very fast

Depends upon the data involved

Queries

Simple queries and relatively standard

Complex queries

Access

Read and write

Read only

Backup recovery

To run the business, operational data is serious, needs data loss, major financial loss and legal ability

Instead of normal backups, many places might simply reload the data as recovery method.

The OLAP typical operations include roll up (drill-up) it increases the aggregation level, drill down (roll down) it decreases the aggregation level together with one or more dimension hierarchy, slice and dice (select and project), and pivot makes cross chart (Chaudhuri and Dayal, 1997). The key features of the OLAP applications found in different functional areas such as multidimensional data view, calculation-intensive capabilities and time intelligence (Aparajita, 2004).

Figure 2.9 Steps for OLAP Creation Process

Scheduled automated Processes Extract Data from OLTPs

Build cubes

Produce reports

Transform and Standardize data

Import to OLAP Database

Extract Data from OLTPs

When designing the solution, OLAP storage is one of the significant choices to be complete. It has three types we can look each of them in detail.

Multidimensional OLAP

Relational OLAP

Hybrid OLAP

2.8.1 Multidimensional OLAP

In MOLAP, source data can be stored in multidimensional cubes and not in relational database and aggregations also stored in the same format. In MOLAP, data retrieval is very fast, although it needs more disk space. Disk space is fewer amounts to worry in these days with poor storage and processing cost (Mailvaganam, 2007). The main advantages are when the cube is generates and querying data can be apply easily for all the calculations are pre-generated. The major drawback of this model is handle only few amount of data.

2.8.2 Relational OLAP

In ROLAP, all the data sources and aggregations are stored in the relational database. For significant data warehouse implementation, this method is worried for higher usage requires. In ROLAP, data retrieval is slow whether an aggregation is present or not. The relational database should

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the UKDiss.com website then please: