Extract Transform Load Processes Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Business intelligence is a set of application and knowledge to analyze or gather data to assist user in better decision making [1]. There is an important area that covered by business intelligence, which is ETL process.

ETL is stand for Extraction-Transformation-Loading, which is a process that used to extract data from different sources, cleansing, customization and insertion into a data warehouse [3]. The Extraction process, mainly used to extract data from external sources, Transformation process is used to transform the data into operational needs to ensure consistency and to meet business requirement. The Loading process is to load the transformed data into the data warehouse or database [4]. ETL process commonly used in data warehousing where the process extract one or more external or internal sources, cleaning and filtering the data, then load it into the data warehouse for the user to get the data for analyzing purpose to help in making critical decision.

There are a lot ETL tools that are famous to use by the user in doing ETL processes, and this research will mainly focus on Microsoft SQL Server ETL tool. Microsoft SQL Server is a relational model database server, in Microsoft SQL Server there is a component called SQL Server Integration Services (SSIS). SSIS is mainly a platform for workflow application and data integration, it contain a good data warehousing tool which for ETL process [5]. The SSIS contain a set of tools that used to support ETL process, such as Data Flow Engine, Scripting Environment and Data Profiler.

Chapter 2 Extract-Transform-Load Processes

Basically Extract-Transform-Load processes are known as ETL processes, which is a continuous stage process in data warehousing. The first stage in these processes is extraction, which extract several data from different databases and various data format. After the data extraction, second stage of the process is transformation, which reformat and cleansing the data to meet the business or operation meet. Lastly, the data will be load into the data warehouse, database or data mart for analyzing purpose. Beside from extraction, transformation and loading data, ETL tool in data warehouse can also use to move data from one operational data to another one. [6]

2.1 Extraction

Extraction is the first part of the ETL process which extracting data from external or internal sources. Normally the sources are maintained by different operating system and hardware which is not compatible.

2.1.1 Data Extraction Method

When starting ETL process, the first thing need to take consider is the method of extraction, because it will affect the transportation process, source system and also the refreshing time for the warehouse. [7] There are two types of data extraction methods, which are logical extraction and physical extraction.

Logical Extraction

Logical extraction method divided into two kinds, they are full extraction and incremental extraction. Full extraction extract the data entirely from the sources, this extraction do not need to keep track of the changes to the data sources because it replicate all the data which is latest in the source system. The example of full extraction is export file of a distinct table. For incremental extraction, the concept are totally different with full extraction because incremental extraction extract data separately while full extraction only extract data in one time only. Incremental extraction need to monitor the changes that made to the sources, at a specific time, the data that has been changed will be extract. This extraction may be the last extraction of the extraction process or a more complex business transaction or event such as last booking day of a fiscal period. The changes information of the data sources must be identified in the specific period. The changes information can be provided by the source data in application column, which reflect the latest changes timestamp or in a table format which include a mechanism that used to keep track of the changes. [7]

Physical Extraction

Physical extraction method also divided into two types, which are online extraction and offline extraction. Online extraction extracts data directly from the source system, which the process connect directly to the source system to access the data table by themselves or connect to an intermediate system that stores the data in preconfigured manner, e.g. snapshot logs and change tables. Online extraction can choose to extract the data or transaction information using original source or prepared source objects. In Contrast, offline extraction does not extract data directly from the source system but staged explicitly outside of the original data source system. These data has an existing structure e.g. archive logs and redo logs, or is created by an extraction schedule. [7]

2.1.2 Data Extraction Design

When designing the data extraction process, there are a few factors that need to consider because it will affect the project cost, understandability, ease of maintenance or development. Those factors that need to consider are:

Data Consistency

If there is a necessary to extract data from more than one source, then the data of these systems must be in a single business world not from different business world in different time warps. In addition, not all sources system data include the date and time stamps in every single transaction. The extraction sometimes needs to solve this problem by running extraction after all the batch and interface runs are complete or do the time stamping for all the extracted data, where dimensions are interlocked with the time dimensions. [8]

Reliable source

The data that need to be extract must be reliable, which mean it must come from a reliable and correct original source. For example, customer information maybe exist in different systems, so the data must be extract from the source that is most updated and complete. [8]

Timely Availability

The extraction process must be completed on time and in the correct format in order for the next stage of the ETL process can further the processing in schedule. For example, if the transformation process needs to start at nine in the morning, then the extraction process should be completed earlier before that time. [8]

Extraction Quality

The quality extraction process must be ensure that the complete synchronization between extracted data in staging area and in production databases. This can be monitor by doing count and aggregation match on the important field in both staging and source system. [8] Besides, the poor quality of the extracted data such as missing value, value not in appropriate format and referential integrity issues, must be concern because the data that load into the data warehouse must be accurate. For example, if the database is use for marketing field, the address of the customer must be validate first to avoid the returned mail issue. [9]

2.1.3 Extraction Sources

Sources in ETL process are different, because these sources are based on the platform, Database Management System and also language that used to connect with the source, for example COBOL, Transact-SQL and etc. These sources are:

Open Database Connectivity (ODBC)

Open Database Connectivity us an interface that used to connect Database Management System and also flat files. ODBC interface contain an ODBC manager to connect the ETL application with ODBC driver. [10]

Flat Files

Flat file normally known as text file, is a basic data set that are non-related. Flat file is allowed to extract data from a database that cannot be entered. Flat file is very important in because most of the data are in flat file format, the ease of handling a flat file is better than handling a Database Management System. In addition, it allow bulk load which in fixed length or in delimited format. [10]

Enterprise Resource Planning (ERP) system sources

Enterprise Resources Planning is used to integrate various data into one database or platform. This system is quite complex because it normally contain more than hundreds of tables, therefore, extracting from ERP is quite difficult. To solve this problem, using special adaptors to communicate with these systems is the best solution, but the adaptor normally cost a lot. The examples for ERP system are Oracle and SAP. [10]


XML is stand for Extensible Markup Language, it’s a data language that independent from platform. Those data from different dependant languages are able to communicate with each other by using XML. XML designed to describe data content and structure. [10]

Web Log and click stream data warehousing

Web log used to show every user that enter a particular website and click stream is used to record and storing the action that done by the computer user. This source is important because it stores the information about the preference and needs of users. [10]


Transformation is the second stage of ETL process, which transforms the extracted data into a data warehouse schema that is consistent by following the business rules and requirement. Data Transformation is the most complicated part in ETL process, because this process need to ensure the accuracy, validity of data, convert data type and business rules application.[11] Data transformation covered a few area, which are:


Cleansing is the process that changes the data that violates the business rules in order to conform to these rules. This process is done by ETL programs to determine the right and correct data values. [12]


The data value will be summarized to get the total figures that will store at multiple levels as business fact in multidimensional fact tables. [12]


Create new data from the extracted data source by using calculations, table lookups or program logic. For example: calculate customer’s age based on current year and their date of birth. [12]


Data element for customers can be aggregated from various source files and databases, e.g. customer file and sales file. [12]


The main aim of integration is to have a result of each data element known with one standard definition and approved name. Integration forces the need to reconcile different extracted data names and values for the same data element. Each data element must be associate with own source databases and also the business intelligence target databases. The data standardization is one of the business objectives. [12]

Sometimes the process of transformation in data source could be an endless process, because there may be a lot of data that need to be transformed to enforce data integrity and business requirements or rules. [12]

2.2.1 Data Cleansing

Data cleansing, also known as data scrubbing, is the process of detect, correct or remove those record that found to be inaccurate or corrupt in the extracted data. The data that are incorrect, inaccurate, irrelevant and incomplete must be replace, modify or delete because these dirty data will mislead or cause the accuracy of the analyze process in the data warehouse or database. [13] Data cleansing operation is very important, because if the data in the database or data warehouse is inaccurate, then it may cause serious problem, such as wrong decision making based on unreliable data. The data cleansing process can perform in one single data or in multiple sets of data. There are two ways in performing data cleansing, which are different depends on the data complexity, manual data cleansing will be invoke if in most simple cases while automated data cleansing will be use if in complex operation. [14]

Manual Data Cleansing

The manual data cleansing done by the person that read or checking the verification of accuracy in the extracted data. They will correct the data that contain error in spelling or a data that complete missing entries. Those unnecessary data in this process will be removed to increase the efficiency of data processing. [14]

Automated Data Cleansing

In some cases, the human manual data cleansing is not that efficiency if the amount of data records is in massive amount or the operation is more complex and need to be complete in a specific period. Therefore, in this operation the human work is replaced by computer programs. [14]

Data cleansing is a very complex and time-consuming process, which required a few amount of work, besides, it is also an important element in the ETL process because it may cause business failure if the data provided in the data warehouse is inaccurate. [14] The data cleansing process divided into few stages, which are:

Data Auditing

The extracted data is audited using statistical method in detecting data anomalies and contradictions. The result will give the indication of the anomalies characteristics and the anomalies data locations. [13]

Workflow Specification

Workflow in after the data auditing, which make consideration and specification in the anomalies that found in the earlier stage. For example, if having typo errors in the data input stage; the keyboard layout can use to find out the possible solutions. [13]

Workflow Execution

After the correctness and specification is verified, then the workflow will be executed, and the workflow should be run efficiently even run on a huge sets of data. [13]

Post-Processing and Controlling

After the cleansing workflow executed, the result must be check to verify the correctness. The people need to do manual checking and make data correction is the result of workflow didn’t make correction to the errors. [14]

During data cleansing operation, there are some popular method that normally used by the user, such as parsing, duplicate elimination and statistical method.


The function of parsing method is to detect syntax error of the data. Parser will decide the value of the data is acceptable within the data specification or not. [13]

Duplication Elimination

By using an algorithm, the data that contains duplicated information or entity will be detected. To make the detection faster, the data will be sort by a key that can bring those duplicated data entries closer together. [13]

Statistical Method

By using standard deviation, range or clustering algorithm, data with unexpected value can be found and this data are normally error. Other than this, statistical method also can be used to replace the missing value in the data by one or more possible values that get from the extensive data augmentation algorithm. [13]

The output or result of the data transformation should achieve a set of quality criteria, which include:


Data must be accurate which can be measured by an aggregate value over the consistency and integrity. [13]


Data integrity is the most crucial standard among the quality criteria, which can be measure by an aggregate value in completeness and data validity. [13]


Data must be complete by doing correction to the data if contain anomalies, for example typo errors or empty value. [13]


The data should be unique and do not occur duplicated data, because it will affect the accuracy of the analyzing process. [13]


The data validity can be checked by the amount of data that meet the integrity constraints requirement. [13]

2.2.2Other Types of Transformation Done In Data

Integration will do to the data when doing data transformation; it will link the data from numerous sources into a well-linked data. Data integration has some key element, which include:

Creating Common Keys

Since data extraction extract data from different sources that having different systems, therefore different systems have different key in represent the same entity. So creating a common key will help to solve this problem to avoid confusion. For example: production system could have agent code different from sales system, which may applicable for customer or other code. [15]

Creating surrogate keys

Creating a surrogate key is to use in loaded data and star schemas. The creation of surrogate key is difficult, but it’s a straight forward method. In high level, we need to get the dimension against which need to create the key, list out the maximum possible instances, put a series of surrogate key (numbers or combination of numbers and characters) for the dimension and lastly map the dimension key to the surrogate key range. [15]


This transformation focus on standardize the description and textual attributes as well. Different data or entities in different systems may have different descriptions. This process will standardize these different attribute into one descriptive attribute for all master and codes. [15]

Data Type Conversion

Some fields in the data source are in inappropriate format, for example credit card field should be in numeric format, but the data is in character form. Some data type conversion may happen in data extraction process because the data using different database model in the target system and source. [15]

Create Derived Attributes

From the original data source, we can create the derive attribute from the field by applying some transformation rules. For example, date field in a data field can be use to derive week of years, month of year, quarter of year and etc.


Some data warehouse model may require the data model to undergo a change to meet its requirement. Therefore, the data need to be de-normalized or normalize in some situation. In most cases, de-normalization is more often than normalize a dimension model. For example a table may have separate field like city, zone, state and etc, de-normalize can use to join these field into one field call as address. [15]


Instances of normalization are less than de-normalization instances. Normalization may require in some situation because those table that highly de-normalized may lead to the problem in slow response time since there is a lot of joins. [15]

Data Relevance

The data that need to load into the data warehouse should be relevant, so any entity or element that are not required for analysis should be eliminate. In data extraction process, some of the irrelevant data will be filtered, but some detailed data still remain in the data source. Therefore, these data transformation will happen in removal of unnecessary fields, removal of entire entity, truncate or eliminate the code. [15]

2.3 Loading

Loading is the last stage of ETL process; it is the phase that loads the extracted data that is clean and filtered into the end target, which normally is a data warehouse. The loading process is differ based on the requirement of the organization such as overwrite the data with cumulative, update the extract data in daily, weekly or even monthly, some data warehouse may load new data in historicized form e.g. hourly. [4]

In order to have an effective and faster speed of loading, there are some methods that can use to achieve this objective.

Turn off Logging

During the loading period, the logging should be turn off because if the logging is on, then the system will create transaction log when doing writing action to the database. Since loading data into data warehouse involve a huge amount of data writing, by turning the logging off, the overhead problem can be avoid. [16]

Drop and recreate indexes

Similar to logging problem, indexing in database will also cause overhead problem. So to solve this problem, the indexes should be drop before start loading tables and recreate the index back after loading is completed. This action will make the indexing process doing in a batch rather than a single activity that every time new record adds into the system. This method will only reduce the overhead if the data loading into the table is quite a lot, if the data is in small amount then recreate index will cost more time than doing it individually to the new record. [16]

Pre-sort the file

The primary key index can be pre-sort for data warehouse loading, with this action; it will speed up the indexing process greatly. Among all the method, this is the technique that most recommended. [16]

Use Append

Instead of using full refresh, append the updated table is better, because adding new records is faster than rebuild the whole table. But if there is a lot new records need to be add, then is better to drop and recreate. [16]

Parallelize table load

Bulk loading in parallel and dropping indices will fasten the loading time, most data warehouse platform have the capability to perform this action. [16]

Manage Partition

The partition should be manage when loading data, because partition can use to divide a table into many smaller part for administration use and also will improve the performance of query in a large fact table. Partition table by using date such as year, month and quarter is the most recommended ways. [17]

Allow incremental loading

Incremental loading should be enabled because it will keep the database synchronized with the source system. [17]

Chapter 3 SSIS

3.1 SQL Server 2008 Architecture

Component Interfaces with Integration Services

(SQL Server Integration Services, 2009) [18]

SQL Server 2008 provide a lot feature that are very useful which include database engine, analysis services for multidimensional data, analysis services for data mining, integration services, replication, reporting services and SQL server service broker. The function of each feature will be explained in below:

Database Engine

Database engine is the centre service for data storing, securing and processing. Database engine has controlled access and rapid transaction processing capability to meet the requirement of data consuming application, besides, it also provide support for sustaining high availability. [19]

Analysis Services for Multidimensional Data

This feature support Online Analytical Processing (OLAP) which can use to create, design and manage multidimensional structure that contains data from other data sources. [19]

Analysis Services for Data Mining

This feature is for create, design and visualize data mining model. These models can be created from another data source by using industry-standard data mining algorithms. [19]

Integration Services

Integration service is the platform for data integration solution. The main process in integration services is extract, transform and load, which mainly used for data warehousing. [19]


Replication is the technology for copy and distributes data or database object from a database to another and synchronizing between the databases to meet consistency. [19]

Reporting Services

Reporting service is a web-enabled and enterprise reporting functionality that can generate report from various data sources, the report can generate in various formats. [19]

Service Broker

Service broker used to build secure and scalable database application. It provides a message-based communication platform that allow independent application component to perform as a functioning whole. [19]