Architecture For Data Warehouse Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

What is business intelligence. Business intelligence is a tools and applications used for analysis and interpretation of data. It also describe as analytical software. Business intelligence helps many organizations to make business decision and recommendations with the used of data warehouse. Usually, the backend system which is data warehouse responsible for achieving business intelligence. Besides, business intelligence also includes insightful analysis of the experience to do data mining and unstructured data.

What is data warehousing? A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process. First of all, a subject-oriented mean a data warehouse is use to analyze and gives information about the specific area. For example, "products" would be a specific area to be analyzed. As for integrated, it means a data warehouse take the data from outside sources. For example, source X and source Z can have many ways of identifying product but in data warehouse there is only a single way to identifying the product. The next term is time time-variant which means data is stored according to the particular time period. All these data are so called historical data. For example, we can find the data from the months we want like 3 or 6 months or even older data from a data warehouse. Compare to a transaction system, the system usually only store the recent data. For instance, a transaction system may have the most recent address of the customer but data warehouse can have all the addresses related to the customer. Lastly but not least, non-volatile means when the data is store in the data warehouse, it will be stable and save forever. So, historical data in a data warehouse should never be altered. As a simple explanation, a data warehouse is used to stored data and creating information with the use of data marts.

Data warehouse architecture is built depends on the environment and situation of the system. So there is no right or wrong architecture. There are many interconnected layers in the data warehouse, the most simple possible conceptualization of data warehouse architecture consists of four layers which are operational layer, data access layer, metadata layer, informational access layer. Below is the diagram for simple data warehouse architecture.

Information Access Layer

Metadata Layer

Data Access Layer

Operational Database Layer

Operational Database Layer

The source data from outside sources.

Data Access Layer

This layer is for information access layer to communicate to the operational database layer. It is responsible for interfacing between information access tools and operational databases.

Metadata Layer

Meta-data is the data about data within the enterprise. The data directories for the entire data warehouse and can be accessed by reporting and analysis tool. End-users can see and get data from data warehouse without need to know where data is.

Information Access Layer

A layer that let the end-user deals with directly using tools like Excel, Lotus 1-2-3, and Access. Also includes hardware and software in displaying and printing reports.


ETL stands for extraction, transformation and loading. It is the core process of data integration and is typically related to data warehousing. ETL term is used to describe data migration or data conversion process. With the amount and complexity of data grew dramatically over the past decade so ETL processed became more complex and demanding.



Reliable Source

Make sure the data is extracted out from the correct and reliable master source. For example, the customer ID and address may be available in three different systems, but it extracts it out of a source where it is most complete.

Data Consistency

Extraction programs pick data from tens of different sources. Making sure that all the data across these systems represent a single business world and not the different business world in different time.

Time Availability

Make sure the data is available for further processing at the right time and in the right form. For example, if the transformation process starts at 5am in the morning, the extraction process should be able to extract out data from all the source systems before that, synchronized with their end of day.

Completeness of extraction

Make sure the data extraction is well audited. The process should be able to run the quality checks to confirm that all the data has been extracted from all sources before proceed to transformation process. For example, these checks are done typically on the basis of last business transaction date.

Quality of extraction

The process should ensure that there is complete synchronization between the extracted data what was in production databases and the staging area. This can be done in many ways like running count and aggregation matches on critical fields.


Extraction is the first part of an ETL process. In this process, we extract data from the outside source systems with a different data organization. The common formats used by systems are relational databases and flat files. There are two types of data sources which is non-cooperative sources and cooperative sources when choosing the data sources. The extract strategy is very dependent on the source type.

Non-cooperative Sources

Snapshot sources

Provides only full copy of source

Specific sources

Each is different, e.g., legacy systems

Logged sources

Writes change log (DB log)

Queryable sources

Provides query interface, e.g., SQL

Cooperative Sources

Replicated sources

Publish/subscribe mechanism

Call back sources

Calls external code (ETL) when changes occur

Internal action sources

Only internal actions when changes occur, e.g. DB triggers

The goal of extraction is fast extract of relevant data because extraction around the outside sources can take a long time to finish. In the process of extracting data, there are two types of extracts which is extract applications (SQL) and DB unload tools. If comparing to the speed of extraction, DB unload tools will be the choice. However, if comparing with co-existence with other applications, the Extract applications (SQL) will be the choices. In addition, the longest time the extraction of data takes it will cause the operational systems and data warehouse systems become heavy loaded. Besides, extraction should extract only changes since the last load of the data source. There are number of methods can be used to extract only changes in data source.




Store sorted total extracts in Data Staging Area (DSA)

Always possible to do it

Handles deletions

Does not reduce extract time

Put update timestamp on all rows

Reduces extract time

Less operational overhead

Cannot handle deletions

Source system must be changed

DB triggers

Operational application need not be changed

Enable real-time update of data warehouse

Operational overhead

Replication based on DB log

Operational application need not be changed

No operational overhead

Not possible in some DBMS (SQL server, DB2, Oracle can do it.)

In actual scenarios data source can be in many forms like Excel, Access, Delimited text, CSV (Comma Separated Files) and etc. So extraction process handle's the complexity of understanding the data source and loading it in a structure of data warehouse.


Transform is the second part of ETL process. This process can also be called as cleaning up process. At this stage, it'll apply a series of rules or functions to the extracted record to derive the data for loading into the end target database. It's not necessary that after the extraction process data is clean and valid. For instance all the financial figures have NULL values but you want it to be zero for better analysis. So you can have some kind of stored procedure which runs through all extracted records and set the value to zero. In a certain situation, one or more of transformation types may be needed to meet their business and technical needs.

Transformation Types:

Select only certain columns to load. For instance you may select the column without null value or blank value.

Encoding. Converting to a common coding system. Gender may be coded, for instance, in a variety of ways (e.g., m/f, male/female, or M/F) in different systems. The extraction program must transform data from each application to a single coding system (e.g., m/f).

Unit of measure. Distance, volume and weight can be recorded in varying units in different systems (e.g., centimeters and inches) and must be converted to a common system.

Field. The same attribute may have different names in different applications (e.g., sales-date, sdate, or saledate), and a standard name must be defined.

Date. Dates are stored in a variety of ways. In Europe, the standard for date is dd/mm/yy, and in the U.S. it is mm/dd/yy, whereas the ISO standard is yyyy-mm-dd.

Filtering and sorting extracted record.

Merge data from multiple sources.

Do simple or complex data validation on extracted records. If the validation fail when validating the records, it may result the extracted records proceed to the next process.

Creating common keys. Different systems have different keys in different systems representing the same entity. For example, the core production system could have agent code different from the sales system. It also applies for vendor, customer and other codes. Basically, this change should be first done in the production systems it-self.

Creating surrogate keys. Creating common keys is linked to production keys. In the dimensional model, one should use the surrogate keys in the loaded data and star-schemas. Another one can keep the production key as one of the attribute in the dimension table.

Data Type conversion. There are many fields in the production data system which is not in used in processing logics. For example, you may find Agents A, having credit cards with the customer in character forms. Then it will need data type conversion in order for the next process.

De-normalization. In the process changing to data warehouse model, it usually comes out with a result normalizing and de-normalizing. De-normalization often is used in the process of comparing because as a dimension model inherently heavily skewed towards being de-normalized. For example, you may have city, zone, region and country in separate tables in a production data model, and then those fields will be de-normalized in the dimension model.

Normalization process is less use for comparing. The reason is that in real world the production database are not normalized to expectations. The reasons are highly de-normalized production tables leads to response time issues due too many joins.

While in the process of transformation, the process must make sure the data quality is in data warehouse format or quality. The first criteria are precise, data warehouse data must match known numbers or an explanation is needed. Secondly is the completeness, data warehouse has all relevant data and the users know it. Third is the consistent to make sure do not have contradictory data in data warehouse. Fourth is the unique value, the data must not duplicate in another place. Last is the timely, make sure data is update frequently enough and the user know when to do it.

Unfortunately, some of the data collected from applications may be dirty. They contain errors, inconsistencies, or redundancies. There are a variety of reasons why data may need cleaning:

The same record is stored by several departments. For instance, both Human Resources and Production have an employee record. Duplicate records must be deleted.

Multiple records for a company exist because of collection of same data. For example, the record for Time Warner should be removed because it was acquired by AOL.

Multiple entries for the same entity exist because there are no corporate data entry standards. For example, FedEX and Federal Express both appear in different records for the same company.

Data entry fields are misused. For example, an address line field is used to record a second phone number.

Data cleaning starts with determining the dirtiness of the data. An analysis of a sample should indicate the extent of the problem and whether commercial data-cleaning tools are required. Data cleaning is not a one-time process. All data added to the data warehouse should be validated in order to maintain the integrity of the warehouse. There are several types of cleansing:

Conversion and normalization

Convert text to certain format according to the coding system and also the date formats.

It is a most common type of cleansing.

Special-purpose cleansing

Normalize the spellings of name, addresses, sale date and etc to a standard formats used in data warehouse.

Remove duplicates record like duplicate customers.

Rule-based cleansing

Clean record with user-specified rules.

Automatic rules. Use data mining to find patterns in data. E.g. guess missing sales person based on customer and item.


Loading is the final process of ETL. Data loading starts after the data in transformation process are ready. The process is straight forward because just move the transformed data to the end target database. There are several methods can be used:

Off the logging while load the data

Once the logging is off, it will save a lot of overhead instead of create transaction log to the action that it do on to the database.

Remove the indexes and create new indexes

It similar to log in process, index is related to the overhead because every write action the process do it will also create the index as well. If the process going on, it will slow down the loading time. So you remove all the indexes and add it later after loading process is completed.

Sort the file by primary key before move the data to data warehouse

Sorting the file greatly decreases the loading time and it is a most often used method. If you want to create index for other column, you can drop and recreate them later.

Balance the views of index

Balancing the views of index is much more suitable for source systems. Loading time is increase when there are too many indexes but it also can't have too few indexes because it will cause the user access time to be increase.

Do not create new table but append to it

By using append it'll consume less time rather create table from zero. There is also a special condition when create new data, if having too many records and data then it is better to drop and create again.

Parallelize the indexes and the progress of loading table

Removing indices and large amount of data loading in parallel will greatly decreases the loading time. Most of the integration services provide this feature.

Limit the number of times of updating

When you may update certain data marts in less times in daily. This can save a lot of time.

SQL Server Integration Services (SSIS)


Four Importance components:

Integration Services Service: Monitoring the package and controlling the storage package.

Integration Services Object Model: Manage the communication between integration services and users for access the tools, command-utilities and custom application.

Integration Services Runtime:  Saves the design of packages, runs packages, and provides support for logging, breakpoints, configuration, connections, and transactions. As for the run-time executables are the package, container and task.

Integration Services Data Flow: Have a data flow engine that manage the moving of data from source to destination and also manage transformation that edit data and finally is load the data to the destination.

Create ETL Package

First of all, the basic requirement of creating ETL package is to understand the format of source data and destination data. After understand those formats you can start the transformation to mapping the source data to destination data.

Look on the source data

Firstly, determine what kind data of it and it store in the flat file, relational file, unstructured data and structured data.

For example: SampleCurrencyData.txt

It has four columns which is average rate of currency, currency key, date key and end-of-date rate. After viewing of the data in source data, then you can determine what type of data type use in destination data.

Look on the destination data

The next step is to design the table data type according to the source data information. For example, like a below diagram.

Column Name

Data Type

Lookup Table

Lookup Column

















Mapping the source data to destination data

After have a completed analysis on source data and destination data, then is how the mapping to be done. From the example, CurrencyKey and TimeKey values must be look up in somewhere else. It should obtain the value from the alternate keys from the dimension tables through transformations.

Flat File Column

Table Name

Column Name

Data Type

















Step 1:

Start the SQL Server Business Intelligence Development Studio

Select Integration Services Project and give a name to the project.

Click finish then an empty package is on the screen.

Step 2:

You have to add connection manager depend on your source file. We using flat file as example so add a flat file connection manager. In the flat file connection manager, you can specific the file name and location, file format and column delimiters. For extracting the exactly file but not the other one, you must create one new flat file connection manager exactly for one file format so it would not extract other flat file. Next, you should identify the column name that match the fact table column name so it will be useful when loading data. Last you should map the data type for each column.

Right click on Connection Managers and choose New Flat File Connection…

Flat file connection manager dialog box appear on screen, type in the connection manager name and browse to the location of flat file.

In the flat file connection manager dialog box, click Advanced.

Change the column name to match fact table column name.

Finally click Ok.

Step 3:

The third step is to add OLE DB connection manager so can connect to destination data source and allow package to be extract or load data into OLE DB data source.

Right click on connection manager and choose New OLE DB connection…

Click New when dialog box appear.

Specify your server name, log on to server, which database to be connected.

Now you will have two connections.

Step 4:

The fourth step is to add data flow task into the package. Data flow task encapsulates the data flow engine to move data from data source to destination source at the same the process of extraction, transformations, and loading occur when it move.

Click the control flow tab.

Drag data flow task from toolbox to design surface.

After drag it to design surface, click data flow task and find name property and rename it something meaningful. In the example, it should be "Extract Sample Currency Data".

It'll appear on screen writing "Extract Sample Currency Data".

Step 5:

In this step, you'll add the from data source to the package. The from data source is a data flow component that uses metadata to specify the format and structure of the data to be extracted from the source by transform process. The extraction process extract data according to the file format definition provided by the connection manager of the from data source.

Once again, click the Data Flow tab.

According to the example, drag a Flat File Source to the design surface.

Click the newly added Flat File Source, find name property and rename it to "Extract Sample Currency Data".

Double click the Flat File Source to open the editor dialog box.

In the connection manager box, select the source data.

Click Columns and verify that the names of the columns are correct.

Click OK.

Step 6:

Next is to add lookup transformations to obtain the value when can look up for any table in destination data source by performs a specified column to column in a reference dataset. In addition, if you have more than two lookup value then you need two or more lookup transformation tool.

In the toolbox, drag Lookup to the design surface of Data Flow tab. Place Lookup below the data source.

Click the data source and pull the green arrow the Lookup transformation to connect it.

Click Lookup transformation and change property name to meaningful name.

Double click Lookup transformation. A dialog box appears.

On the General page, select Full cache and in the Connection type area select OLE DB connection manager.

Next is the connection page, ensure the destination data source name is displayed.

Retrieve the look up value by type in the SQL query.

Then go to Columns page, drag the field from data source and drop it to the field of destination data source. Select the key primary key.

If you have two lookup transformation, then the second one you need to drag another lookup and put below the first lookup. Join drag green arrow from first lookup to second lookup.

In the output and input dialog box, click Lookup Match Output in the Output list box, and then click OK.

Double click second lookup, in the General page select partial cache.

On Connection page, make sure destination data source is displayed.

In the Use a table or view box, select the table you look for.

Then go to Columns page, drag the field from data source and drop it to the field of destination data source. Select the key primary key.

On the Advanced page, review the caching option. Finally click OK.

Step 7:

From step 6 you have done, your package now can extract data and transform data to the specific format. In this step is to load the transformed data to the destination data source.

Drag OLE DB Destination from toolbox to design surface in the data flow tab.

Drag the green arrow from second lookup to the OLE DB Destination

Click Lookup Match Output in the Input Output Selection dialog box. Click OK.

Rename OLE DB Destination name property to something meaningful.

Double click OLE DB Destination, ensure the destination data source is selected.

Select the fact table in the Name of the table or view box.

Click mappings and make sure all the columns from input columns are mapped correctly to the destination columns. If the column is not map then the mapping is not correct. Finally Click Ok.

Final Step:

You should the following diagram.

Before you run the package, you should contain the objects shown in the above diagram. To run the package, go to Debug menu then click Start Debugging.


In my conclusion, a big organization needs a data warehouse database badly for manage their data. Because every system stores the data with different format, so they need a data warehouse to manage their data in a standard format. It is easy for them to display report and doing analysis, especially the ETL process. With just click, drag and drop will create a data warehouse database. Furthermore, the process to create ETL package is easier compare to last time. Last but not least,

you can build the best data warehouse in the world, but if you can't enter data into it, it will become useless. Creating the ETL processes is the key to the success of your data warehouse if you take your time to do it wisely and finally you can make your life easier also. So for anyone who wants to build business intelligence system, then you need to take care and take time to create a good ETL process.