Concepts And Technology Of Data Extraction Transformation Loading Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Extraction-Transformation-Loading is the process of moving data flow various sources into a data warehouse. In this research we will analyze the concept of ETL and illustrating using example of Microsoft SSIS (SQL Server Integration Services) as the basis of the research. Explanation on specific steps will be show in the research such as (a) Extracting Data - From one or more external data source. (b) Transforming Data - Ensure consistency and satisfy business requirements and (c) Loading Data - To the resultant Data Warehouse. In depth analysis on Microsoft SSIS tools which supporting ETL process are including in the research for instance: (a) Data Flow Engine, (b) Scripting Environment and (c) Data Profiler.

Key Words: ETL process, Microsoft SQL Server Integration, SSIS.

1. Introduction

ETL is the most important process in a Business Intelligent (BI) project [1]. When international companies such as Toyota want to reallocate resources, the resources must be reallocated wisely. Consolidate data to useful information from multi regions such as Japan, US, UK and etc is difficult in many reasons including overlapping and inconsistency relationship among the region company. For example, the method of storing a name is different between the companies, in Japan its store as T.Yoon Wah, in US: Yoon Wah Thoo and UK is storing as YW.Thoo. When data is being combining to generate useful information, this may lead to inconsistent of data. In order to solve the problem, we need to use star schema/snowflake schema data warehouse takes the data from many transactional system, and copy the data into a common format with the completely different relational database design than a transactional system containing many star schema configuration. [7]. Performing the task associated with moving, correcting and transforming the data from transaction system to star schema data warehouse, it is called Extraction, Transformation and Loading (ETL). ETL allows migrating data from relational database into data warehouse and enable to convert the various format and types to one consistent system. It is a common use for data warehousing, where regular updates from one or more systems are merged and refined so that analysis can be done using more specialized tools. Typically the same process is run over and over, as new data appears in the source application [2]. The ETL process consists of the following steps: [3] 1. Import data from various data sources into the staging area. 2. Cleanse data from inconsistencies (could be either automated or manual effort). 3. Ensure that row counts of imported data in the staging area match the counts in the original data source. 4. Load data from the staging area into the dimensional model.

2. In-depth research on ETL

In Fig. 1, we abstractly describe the general framework for ETL processes. In the bottom layer we depict the data stores that are involved in the overall process. On the left side, we can observe the original data providers (typically, relational databases and files). The data from these sources are extracted (as shown in the upper left part of Fig. 1) by extraction routines, which provide either complete snapshots or differentials of the data sources. Then, these data are propagated to the Data Staging Area (DSA) where they are transformed and cleaned before being loaded to the data warehouse. The data warehouse is depicted in the right part of Fig. 1 and comprises the target data stores, i.e., fact tables and dimension tables. [4]

2.1 Extraction

The extraction part will gathering the data from several resources and do analysis and cleaning data. Analyzing part will be getting raw data that was written directly into the disk, data written to float file or relational tables from structured system. Data can be read multiple times if needed in order to achieve consistency. Cleansing data will be done in extraction part either. The process will be eliminating duplicate or fragmented data and excluding the unwanted or unneeded information. The next step will move forward to transformation part. In Microsoft SSIS, we could use the tools in the Data Flow control which is called Integration Service Source in order to retrieve sources from several formats with connection manager. The source format is various such as OLE DB, Flat file, ADO NET source, Raw Files source and etc [11].

2.2 Transformation

The Transformation step might be the most complex part in the ETL process because it might be consist of much data processing during this step. The transformation part is to prepare the data to be store in the data warehouse. Converting the data such as changing data types and length, combining data, verification and standardize the data will be done in transformation part. Using SSIS, it provides plenty of transformation tools to help developer to achieve their target. There are categorized Transformation in SSIS to allow designer developing their project: Business Intelligence,

Row Transformation, Row set, Split and Join Transformation, Auditing Transformation, and Custom Transformation. For instance which commonly use in ETL process are : Data Conversion Transformation - Converts the data type of a column to a different data type , Conditional Split Transformation - routes data rows to different outputs. More Transformation example can be found in SQL MSDN at [10].

2.3 Loading

The Loading step is the final step of the ETL process; it uses to store generated data into the data warehouse. The loading step can follow the star schema [5] or snowflake schema [6] in order to achieve data consolidation [7]. Implementing in SSIS will be using Integration Service Destination it's similar with the Integration Service Source, using connection manager to choose one or more data destination to load the output. [12]

3. Microsoft SQL Server Integration Services

ETL tools are created for developer to plan, configure and handle ETL process. With tools that develop by Microsoft, developer has now has the ability to more easily automate the importing and transformation data from many system across the state. The Microsoft SQL Server 2005 which assist to automate the ETL process, its call SQL Server Integration Service (SSIS). This tool is design to deal with common issues with ETL process. We will build up the research paper from ground-up base on studying the ELT tools that build by Microsoft which is SSIS.

3.1 SSIS Architecture

In fig 2 shows the overview of the SSIS architecture. SSIS is a component of SQL Server 2005/2008, it able to design ETL process from scratch to automate the process with many supportive tools such as database engine, Reporting services, Analysis services and etc. SISS has segregated the Data Flow Engine from the Control Flow Engine or SSIS Runtime Engine, designed to achieve a high degree of parallelism and improve the overall performance. Figure 2: Overview of SSIS architecture.

The SSIS will be consisting of two main components as listed down below:

SSIS Runtime Engine - The SSIS runtime engine manage the overall control flow of a package. It contains the layout of packages, runs packages and provides support for breakpoints, logging, configuration, connections and transactions. The run-time engine is a parallel control flow engine that locates the execution of tasks or units of work within SSIS and manages the engine threads that carry out those tasks. The SSIS runtime engine will performs the tasks inside a package in a traditional method. When the runtime engine meets a data flow task in a package during execution it will creates a data flow pipeline and lets that data flow task run in the pipeline. [9]

SSIS Data Flow Engine - SSIS Data Flow Engine handles the flow of data from data sources, thru transformations, and destination. When the Data Flow task executes, the SSIS data flow engine extracts data from data sources, runs any necessary transformations on the extracted data and then generate the data to one or more destinations.

The architecture of Data flow engine is buffer oriented, Data flow engine pulls data from the source and stores it in a memory and does the transformation in buffer itself rather than processing on a row-by-row basis. The benefit of this in-buffer processing is that processing is much quicker as there is not necessary to copy the data physically at every step of the data integration; the data flow engine processes data as it is transferred from source to destination. [9] We enable to do ETL practical in the Data Flow Task which can be found in the fig 2. Extract data from several sources, transform and manipulate the data, and load it into one or more destination.

3.1.1 Data Flow Engine

Regarding the SSIS Data Flow Engine mentioned previously, here to discuss about how it is related with the process ETL with Data Flow Elements. SSIS consisting three different types of data flow elements which are sources, transformations, and destinations.

In a Data Flow, sources normally extract data from several sources such as flat file, OLE DB and Raw data. This process is as the Extraction in ETL process. Transformations typically used to edit, convert, join, and clean data. Destinations will load the generated data into data stores. These three steps are exactly matching as the Extraction-Transformation-Loading process in ETL.

Plus, SSIS have links that connect the output of one element to the input of another element. Links will definite the sequence of components, and allow user add labels to the data flow or examine the source of the column.

Figure 3: Data Flow Elements

Figure 3 showing that the data flow consist of source, transformation with an input and an output and lastly the destination. Figure 3 also include extra input and output for the external columns

In SSIS, a Source is a data flow element that generates data from several different external data sources. In a data flow, source normally has one. The regular output has output columns, which are columns the source adds to the data flow.

Transformations, the possibility of transformations are infinite and vary wide. Transformations can execute tasks such as editing, joining, cleaning, combining, and converting data.

In a transformation, input and outputs define the columns of incoming and outgoing data. Depends the operation runs on the data, some transformations have individual input and several outputs, while other transformations have several inputs and a output. Transformations can include error outputs either, which give data about the error that occurred, combine with the data that failed: for instance, char data that could not be converted to a date data type.

Lastly, Destinations is the data flow element that loading the result to a specific destination such as flat file, or creates an in-memory dataset.

SSIS destination requires at least one input. The input contains input columns, which come from another data flow element. The input columns will be map to columns in the destination. [17]

31.1.1 Example of Data Flow Task

Here to presenting the example to create a simple data flow task a.k.a. ETL process using OLE DB. First thing, drag the Data Flow task from the toolbox into Control Flow.

Then double click the Data Flow Task and leads into Data Flow Design view and drag a OLE DB Source from toolbox.

Double click on the OLE DB Source and create a new connection. Here I select the "Human Resource" table and test connection.

Next, at OLE DB Source Editor, choose the Data access mode to SQL command because we are going to insert a SQL-Transact into it. Write

"SELECT * FROM HumanResource.Employee WHERE BirthDate < '1980' "

Into the SQL command column, press parse query then press ok button.

Next, we need to use conditional Split on Data Flow Transformation, just drag it over and link the input path from OLE DB Source component to the Conditional Split component. Double click the Conditional Split component to go in the editor mode. Here we drag gender from the upper left column into the lower right "Conditional" column. Now we can do many functions such as Mathematical function, String function, NULL function and etc. Here we use Operator function - Equal to continue our example. The condition will be

[Gender] == "F" and press ok button.

Next, we have to drag a Flat File Destination [12] from Destination Source into the data flow design view and connect it with the successful link from the Conditional Split.

Then browse the file name which as the output of the result. You may create new .txt file on site after you click the "Browse…" button. After selecting the Loading destination, click ok button.

Next, press F5 to execute the package. The three components will color Green after processing the data which indicating successful process. Check the output file to get the generated result.

That is it; we had just run thru a simple ETL process.

3.1.2 Scripting Environment

If all the build-in tasks and transformation doesn't meets the developer needs, SSIS Script task/Script Component to code the functions that developer desire to perform.

By clicking the "Design Script…" button in the Script Task Editor, it is able to open a Visual Studio for Application to code the function. [19]

That is improvement in scripting environment between SSIS 2005 and 2008. In SSIS 2005, you can find double click on Script Task and Script Task Editor will be appears. The Script language of SSIS 2005 is only for Microsoft Visual Basic .Net but in SSIS 2008, it is able to choose C# or

Figure: Visual Studio for Application (VSA)

 Script task usually used for the following purposes:

Achieve desire task by using other technologies that are not supported by built-in connection types.

Generate a task-specific performance counter. For instance, a script can create a performance counter that is updated while a complex or poorly performing task runs.

Point out whether specified files are empty or how many rows they contain, and then based on that information affect the control flow in a package. For example, if a file contains zero rows, the value of a variable set to 0, and a precedence constraint that evaluates the value prevents a File System task from copying the file. [20]

3.1.3 Data Profiler.

The purpose of data profiling is to approach defining data quality. A data profile is a collection of combination statistics about data that may consist the value of rows in the Customer table, the number of distinct values in the Street column, the number of null or missing values in the Name column, the distribution of values in the Country column, the strength of the functional dependency of the Street column on the Name column-that is, the Street should always be the same for a given name value etc. [16]

Data Profiling task in SQL Server 2008 SSIS providing data profiling functionality inside the process ETL. By using the Data Profiling task, analysis of source data can be perform more efficiently, better understanding of source data and avoid data quality problems before load into the data warehouse. The result of the analysis will in XML format which can be view by using Data Profile Viewer. [13] Example of Data Profiling Task

Using Adventure Works Database:

After drag the Data Profiling Task into the Control Flow, double click it to enter properties window to do configuration. The Data profiling Task required connection manager in order to works. In properties menu, user chooses destination type in file destination or variable. Faster way to build profile using quick profile feature:

Figure 4: Single Task Quick Profile Form

Eight different data profile can be generating by The Data Profiling Task. Five of these profiles analyze individual columns, and the remaining three analyze several columns or relationships between columns and tables; for more details about each profile refer to MSDN. [16] Few examples are made to explain further about Data Profiling:

Figure 5: Editing the Data Profiling Task

After done mapping the destination and other properties, run the package.

Figure 6: Data Profiling Task Successfully Executed

The Task successfully executed (Green), now need to use data profiler viewer to view the result. Data Profile Viewer is stand-alone tool which is used to view and analyze the result of profiling. It uses multiple panes to display the profiles requested and the computed results, with optional details and drilldown capability. [16]

Column Value Distribution Profile: Used to obtain number of distinct value of a table.

Figure 7: Result of Column Value Distribution Profile.

Column Null Ratio Profile - Obtain the null column of the table.

Figure 8: Result of Column Null Ratio Profile.

Column Statistic Profile - Obtain the Min, Max, Means and Deviation of a table.

Figure 9: Result of Column Statistic Profile.

Column Pattern Profile - Obtain the pattern value of the column.

Figure 10: Result of Column Pattern Profile.

3.2 Others

There are other important mechanism which affecting the process of ETL using SSIS such as Parallel Processing, Performance Optimization and SSIS Buffer Management.

Parallel Processing typically using the MaxConcurentExecutable Property to control how many task can run at the same time; by specifying the maximum number of SSIS threads that can execute in parallel per package. For instance, if the value changed to 4 , then only 4 task will be running at the same time.[21]

Performance Optimization is the tweaking the buffer used in the pipeline to achieve most efficiency. For instance, when the server has sufficient memory, use larger buffer instead of small buffer so it would takes memory instead of processing speed. But this configuration need to be plan and test wisely if not may become the double edge sword and causes low performance for the server. [21]

4. Other ETL Tools