This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
When we consider Data warehousing it is a combination of data from different data sources into single comprehensive and easily maintainable database. We can access data warehousing by using queries, analysis and reporting. Data warehousing always creates single database in the end of the process. The end result of the data warehouse is homogeneous data, which can be more easily and quickly handle. Data warehousing is basically used by organizations to analyze trends over time. Its primary target is facilitating strategic planning and forecasting by using long-term data analyzing. By using such analyzing we can easily create business models, forecasts, other reports and projections.
The first stage of an ETL process involves extracting the data from the source systems. Most data warehousing projects contained data from different data sources. Those data sources can contain different data types and formats. Most Commonly used data source formats are relational databases and flat file systems. The key function of extraction phase is to convert the data into a single format, which is suitable for transformation processing.
The transform phase applies a series of rules or functions to the extracted data before loading into the end destination. Some data sources will not require any kind of data manipulation. Sometimes, it requires below mentioned transformation types to meet the business and technical needs of the target database:
Need to select only special columns to load (or not select null columns)
Translating coded values (e.g., when the source system stores 1 for male and 2 for female, but in the data warehouse stores M for male and F for female), this kind of task is called automated data cleansing. In ETL process there is no manual cleansing occurs.
Joining data from multiple sources (e.g., lookup, merge)
Can Derive a new calculated value for columns (e.g., sale_amount = qty * unit_price)
The load phase loads the data into the data warehouse . According to the companies requirement this process may changed widely. Sometimes data warehouses existing information will update daily, weekly, monthly or year basis. In the load phase interacts with a database, the constraints and triggers defined in the database schema is activated at the time when data load (as example, primary key, unique key and referential integrity constraints).Those constraints are help for overall data quality and performance of the ETL process.
Reporting tools are used for designing the 2D and 3D graphs for representation to take management decisions based on the data in the Data warehouse. Such a tools produce outputs can be stored and reviewed on later time as well. Generally reports are produced on timely manner such as daily, weekly or monthly.Â Â
In general Reporting tools can be classified as
Less Ad Hoc
Report View of Data (Header and Detail)
retrieve by using pre-defined user scenario
Display small or medium size of data
Some Examples of such a reporting tools are below:Â
Access - Microsoft
Managed Reporting Environment
SQL Server 2005 Reporting Services - Microsoft
Open Source Tools
Microsoft (SQL Server Integration Services)
SQL Server Integration Services (SSIS) is a component of the Microsoft SQL Server database software. We can use this for do huge range of data migration tasks.
SSIS can use as a platform for data integration and workflow applications. It contains lots of features for fast and flexible data warehousing tool which used for data extraction, transformation, and loading (ETL).
Microsoft Firstly released with Microsoft SQL Server 2005 then 2008.SSIS service is only available in the "Standard" and "Enterprise" editions of SQL server versions
You can use SSIS to transfer millions of rows of data to and from heterogeneous data sources. But that is not the end of SSIS functionality. This tool also acts as Business Intelligent (BI) functionalities such as complete data integration, movement, and shaping. It means that SSIS provides data cleansing, extensibility, and interoperability.
SSIS has lot of pre-built data-cleansing functionality, including completely integrated functions, such as fuzzy matching and fuzzy grouping that use algorithms to match or group disparate data to a configurable degree of accuracy.
SSIS offers grate support for third-party component vendors also. SSIS allows building your own components or adding a third-party component to solve your problem.
SSIS can load data directly into Analysis Services cubes, and it also offers robust data-mining features for including scalable data-mining model creation, training, and predictions.
SSIS is well integrated with SQL Server Reporting Services, so this will allow you to treat an SSIS package as the data source for reporting.
SSIS also provide high performance and scalability that allow you to host complex, high-volume ETL applications on lightweight servers
SSIS can also help reduce ETL data staging areas and help minimize performance costs associated with data staging (disk I/O and serial processing). This will lead to perform complex data transformations, data cleansing, and high-volume lookups from source to destination.
SSIS also provides the Slowly Changing Dimension (SCD) wizard. By using SCD interface, you can quickly generate all the steps and required code to add unique handling of history to multiple attributes in a given dimension.
Business Intelligence Development Studio is used for SSIS development environment and is hosted in Visual studio. So that can use scripting and programming languages to take advantage of enterprise development environment
SSIS now fully supports the Microsoft .NET Framework .so that software engineers can use any .NET compliant language to develop the functionality of SSIS.
The Data Transformation run-time engine of SSIS is used both native COM object model and as an entirely managed object model. But Data Transformation run-time engine is written in native code. Although it signed Primary Interop Assembly (PIA) enables full managed access to it
Technological infrastructure required
1. Operating System:
Windows XP Professional SP2
Windows Server 2003 SP2 Enterprise
Can find broad documentation support and best practices to data warehouses and much more
ease to learn and speed of implementation
standardized data integration methods are used
It provides real-time, message-based notifications
With compared to other commercial products cost is relatively low , excellent after sales support and distribution model
The key problem on SSIS is that runs only on Windows environments. It doesn't give any support for Linux and Unix
Microsoft always hides their future path until Beta version is released. So that unclear vision and strategy is adhere with SSIS
Talend Open Studio
Talend Open Studio is open source data integration product designed to combine, convert and update data in various locations across a business.
Talend has the design tool which can build the Jobs, using the set of components available.
It works with project concept, which is a container of different Jobs with metadata and contexts.
Talend is a code generator, so Jobs are translated into corresponding defined language (Java or Perl can select when create a new project), compiled and executed.
Talend Components are bind to each other with different types of connections. One is to pass information (which can be of Row or Iterate, as how to move the data). Also, you can connect with each other triggering connections (Run If Component is Ok else Component Error) that allow us to clear the sequence of execution and ending time
Talend Jobs can run independently of the design tool on any platform that allows the execution of the selected language.
Talend generated code is visible and modifiable (although you modify the tool to make any changes to the Jobs).
Talend has a large number of components. According to the action we can select the component and access to databases or other systems. There are different components according to the database engine that we will go to use. As a example, we have an input table component for each manufacturer (Oracle, MySQL, Informix )
Talend works with the workspace concept, at filesystem level.Â This is the place you store all the components of a project (all Jobs, metadata definitions, custom code and contexts).
Talend repository is updated with the dependencies of changed objects (expand to all project changes). If we change the table definition in repository, for example, is updated in all the Jobs where it is used.
Talend manages full metadata that includes links to databases and the objects (tables, views, querys).Metadata info is centrally stored in workspace and its not necessary to read again from source or destination system, which make simple and efficient the process. In addition if we want can define metadata file structures (delimited, positional, Excel, xml, etc), which can then be reused in any components later.
Talend allows us to use our custom code using Java and Groovy.
Technological infrastructure required
Windows, Unix and Linux.
Talend has Unified user interface across all components. Based on Eclipse, the knowledge of the tool enables us to use the interface.
Talend has large number of components to connect to multiple systems and data sources, and constantly evolving.
Talend can learn easily by using help shortcut in the application and comprehensive online help components.
In Talend when we designed our own code in Java, we have the context assistance of language provided by Eclipse GUI.
In Talend we can easily develop our own components by using existing code(code reuse). We can include our own libraries, which are visible in all Jobs in a project.
By using Talend we can easily model charts and can conceptually draw our Jobs designs and processes.
Sometimes Talend get excessively slow caused by the use of Java language
Talend Tool is unintuitive and difficult to understand
IBM (Information Server Infosphere platform)
IBM InfoSphere Data Stage integrates data across multiple and large volumes of data sources and target applications. It supports real-time data integration with a high performance parallel framework, extended metadata management, and enterprise connectivity.
IBM InfoSphere DataStage is powerful ETL solution. That supports the collection, integration and transformation of high volumes of data, with data structures varying from simple to highly complex. IBM InfoSphere DataStage also can manages data arriving in real-time as well as data received on a time to time or scheduled basis
IBM InfoSphere DataStage Enterprise Edition provides the parallel processing capabilities of multiprocessor hardware platforms. Because of that it can support to satisfy the demands of growing data volumes, strict real-time requirements
IBM InfoSphere DataStage support for a virtually unlimited number of heterogeneous data sources and targets in a one job includes text files; complex data structures in XML; ERP systems such as SAP and PeopleSoft; almost any database (including partitioned databases); web services; and business intelligence tools similar to SAS.
IBM InfoSphere DataStage supports real-time data integration operation. It can captures messages from Message Oriented Middleware (MOM) queues using JMS to combine data into historical analysis perspectives.
IBM InfoSphere Information Services Director provides a service-oriented architecture (SOA).We can use it to publishing data integration logic as shared services and after that it can be reused across the enterprise. These types of services are capable of simultaneously supporting high-speed, high reliability requirements of transactional processing and the large volume of bulk data requirements of batch processing
IBM InfoSphere DataStage's advanced maintenance and development allows developers to maximize speed, flexibility and effectiveness in building, deploying, updating and managing their data integration environment. Complete data integration reduces the development and maintenance cycle for data integration projects by easy administration and maximizing development resources
IBM InfoSphere DataStage can use to perform information integration directly on the mainframe. Because of that it enables to use existing mainframe resources in order to maximize the value of your IT investments
Technological infrastructure required
Windows, Unix and Linux.
IBM InfoSphere DataStage has strongest vision on the market and flexibility
progress of IBM InfoSphere DataStage is towards common metadata platform
IBM InfoSphere DataStage has high level of satisfaction from clients and a variety of initiatives
IBM InfoSphere DataStage has difficult learning curve
It also has long implementation cycles
IBM InfoSphere DataStage requires high performance computers (i.e. high memory and lot of processing power)
Pentaho Data Integration (KETTLE)
Pentaho Data Integration (PDI) delivers powerful Extraction, Transformation and Loading (ETL) facilities by using an innovative, metadata-driven approach. Pentaho has an intuitive, graphical, drag and drop design environment, and a proven, scalable, standards-based architecture.
Pentaho has the design tool built Spoon transformations (minimum design level) using the steps.At a higher level we have the Jobs that let you run the transformations and other components, and orchestrate process.
Pentaho is not a code generator.It is a transformation engine, where data and its transformations are separated.
In Pentaho the transformations and Jobs are stored in XML format, which specifies the actions to take in data processing.
In Pentaho transformations use steps, which are linked to each other by jumps, which determine the flow of data between different components
for the jobs, we have another set of steps, which can perform different actions (or run transformations). The jumps in this case determine the execution order or conditional execution.
In Pentaho for similar actions (eg reading database tables), a single step (no one from each manufacturer), and behavior according to the database defined by the connection.
In Pentaho dependencies are not updated if you change a transformation that is called from another. If the level of components within a single transformation or job.
In Pentaho the metadata is limited to database connections, which metadata can be shared by different transformations and jobs.
Database information (catalog tables / fields) or files specifications (structure) is stored in steps and can not be reused In Pentaho.This info is read in design time.
Using Variables in Pentaho tool parameters file (file kettle.properties). Passing parameters and arguments to the process (similar to the contexts), both in jobs and transformations.
Technological infrastructure required
Windows, Unix and Linux.
Pentaho is a transformation engine, and notes from the outset has been designed by people who needed to meet their needs in data integration, with great experience in this field. It is also easier to manage the datatypes with Pentaho it is not as strict as Java.
Pentaho is very intuitive tool,Â with some basic concepts can make it works. Conceptually it is very simple and powerful.
Pentaho database repository gives us many opportunities for teamwork. In this repository is stored xml, containing the actions that Transformations and Jobs take on the data.
By using Pentaho tool for design of the interface can be a bit poor and there is no unified interface for all components, being sometimes confusing.
Pentaho is much slower tool evolution and uncertain because Pentaho tends to leave the OpenSource focus
Pentaho has very Limited availability of components, but more than enough for most ETL or data integration process
Pentaho has very poor Help , almost nonexistent in the application.The online help in the Pentaho website is not particularly full, and in some parts is very small, so that the only way to determine the functioning of the component is test it.