This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
This type of data errors is basically mechanical which happens due to the inability of the legacy system to automatically validate certain user inputs. It is often difficult to prevent this type of errors from taking place in the system as it is a part and parcel of every legacy system.
Human Errors are a major source of data manipulation in the legacy system. Again, this can take place because of the inability of the legacy system to validate data entered manually by users. However, some of these errors are rather logical nature. Let's take the example of a date field which refers to the purchase data of any product. It may so happen that the user inputs a date which might be valid from his prospective, but it could wrong from the business aspect, i.e. the user might have input a date on which a business transaction might not have taken place.
It could happen for two reasons. Firstly, the user might input data forcibly to conform to the requirement of the legacy system. Secondly, the user can try to purposely manipulate the data in the legacy system to fulfill his ends.
Target system model definition
If the target system model dictates the in a certain format which can not be found in the legacy system this kind of errors can crop up.
This section reviews the key data quality problems that need to be addressed by data cleansing and data transformation. Database problems exist in either the structure of the database schema (schema-level problems) or in the content of database (instance-level problems). Databases that have schema-level problems may result in having instance-level problems or vice-versa. These are the primary focus of data cleaning. Databases can either be single source or multi-source databases. The following subsection will review schema-level and instance-level problems in the context of single and multi-source databases.
The data quality of a source mainly depends on the measure to which it is govern by schema and integrity constraints controlling permissible data values. For sources without schema, such as files, there are few limitations on what data can be entered and stored, giving rise to a high probability of errors and inconsistencies. Database systems, on the other hand, implement restrictions of a specific data model as well as application-specific integrity constraints. Schema-related data quality problems thus occur because of the lack of appropriate model-specific or application-specific integrity constraints, e.g., due to data model limitations or poor schema design, or because only a few integrity constraints were defined to limit the overhead for integrity control. Instance-specific problems relate to errors and inconsistencies that cannot be prevented at the schema level. [Erhard Rahm*,2000]
DATA CLEANING PROBLEMS
SCHEMA LEVEL PROBLEMS
LACK OF INTIGRITY CONSTRAINTS
POOR SCHEMA DESIGN
HETEROGENEOUS DATA MODEL
INSTANCE LEVEL PROBLEMS
DATA ENRY ERROR
IRREGULAR AND LATE DATA
Schema level problems
Lack of integrity constraint
Data integrity allows to defining certain data quality requirements that the data in the database needs to meet. If a user tries to enter data which does not match with business requirements then also database allow inserting. This types or error will occur due to lack of integrity constraint e.g., when we are creating online account in the form if last name is compulsory to feel than user leave null last name than also that field allow to go ahead it is lack of integrity constraint. There are five integrity constraints. Not null, unique key, primary key, foreign key, check
Poor schema design
At the time of schema development primary key not defined, validity check, integrity constraint not defined, constraint not given, and schema without normalization.
It means provide individually unique number. User can take a primary key or auto increment column into database table. Uniqueness violation error occur when the same number provided to many persons like same student id is provided for two different students e.g.; Student1= (name="Hardik", ID="000423601") Student2= (name="Jatin", ID="000423601")
Referential integrity means in master table there is a primary key and slave table that primary key will be reference key. In this type of situations when records are missing from master table so it is difficult to get correct data into slave table. e.g.; Student= (name="Hardik", deptname="CMS") so in this record department name CMS will be missing from department master table than it will be consider as a dirty data. It id similar as violated attribute dependencies e.g.; in the employee table there will be department id it will be foreign key from department table if user will try to enter department id into employee table which is not in the department table so it will violate the foreign key constraint.
it mean the values which is enter by user it is not in domain range like in marriage date user has entered 31.02.2000 than it is wrong because this type of date is not available in the domain name so it will be treat as a dummy data.
Heterogeneous data model
A data-warehousing term that describes the idea of drawing data from several different data sources on different platforms and computers or it would be different types of data, especially in the same data source. E.g.; this data may have two different types of data: strings and ints.
Since the data is coming from several sources with varying data types and data precisions, it is likely that the source may not match the target. E.g.; hard coded value for CUST_NAME. Since the constant 'CUST1' was used, the data type defaults to CHAR, not VARCHAR2. The column in the target table is VARCHAR2(20), So it will cause a problem. This problem is resolved by the CAST function used below. SELECT CAST('CUST1' AS VARCHAR2(20)) AS CUST_NAME; This expression creates a column called CUST_NAME with the data type VARCHAR2(20), not CHAR.
Naming conflicts arise mostly because of homonyms and synonyms. When the same name or similar sounding words are used for different objects or meaning which are called homonyms such as boar and bore and some of identically spelled words with different meaning such as fair (meaning festival) and fair (meaning Just). Example when 'cname' field is used in customer table as a customer name and also 'cname' field is used in counter or color table. In the database context, a synonym is the opposite of a homonyms and it indicates the different names are used for the same objects. Example car and auto refer to the same object. This type of homonyms and synonyms is confusion at the time of writing query or programming so this problem should be avoid at the time of database design. At the time of database design naming conflicts is one of the most important part example column names and table names must be meaning full like student_id, student_names it should not be like A,B,C.[Erhard Rahm*,2000]
Structural conflicts occur in many variations and refer to different representations of the same object in different sources, e.g., attribute vs. Table representation, different component structure, different data types, different integrity constraints, etc. [Erhard Rahm*,2000]
Some of the cases data conflicts appear like a duplicated records or contradicting records. When there are the same attribute names and data type available. E.g.; information from source-A says that Hardik lives in London; information from source-B says that Hardik lives in Scotland. In this case it is difficult to take decision we can use both if he lives at both places or we can use the most recently updated information or we can user most trusted information or we should not use anything rather than use wrong
Everyone is not using the same format e.g.; dates are especially problematic 12/19/77, 12/19/1977, 12-19-77, 19/12/77, Dec 19, 1977, 19 December 1977
At the time of store price of a skillet in France and the price of a port in Germany into database. We cannot store it all in the same currency (say, US$) because the exchange rate changes. Price in foreign currency stays the same. On this situation according to me, we keep the data in foreign currency and use the current exchange rate to convert
Instance level problems
A main problem for cleaning data from multiple sources is to identify overlapping data, in particular matching records referring to the same real-world entity (e.g., customer). This problem is also referred to as the object identity problem, duplicate elimination or the merge/purge problem. Frequently, the information is only partially redundant and the sources may complement each other by providing additional information about an entity. Thus duplicate information should be merge/purged out and complementing information should be consolidated and merged in order to achieve a consistent view of real world entities.
In the data warehouse, data cleansing is applied particularly when several databases are merged. Records referring to the same entity are represented in different formats in the different data sets or are represented erroneously. Thus, duplicate records will appear in the merged database. The issue is to identify and remove these duplicates. This problem is known as the merge problem.
The multisource problems and initial approaches to solve are discussed in above chapter. This chapter is mainly focus on the implementation to solve multisource data cleansing problems of data warehouse using different components and standard query language.
There is variety of tools available to clean data few of them I have compared in the literature review. And finally I have decided to use SQL Server Integration Services (SSIS) tool which is excellent data transformation platform built on Microsoft SQL Server. I have decided to use this tool because it contains vastly more functionality and there are many components available to clean data and there are few components where own logic can be applied to clean data as per needed which facility may not available in any other cleansing tools. It provide variety of different tasks from basic operation like import or export data to complex operations like ETL task and support heterogeneous and homogenous data source. The powerful functionality to develop this product control flow and data flow area where I have executed self-contained packages to clean the data.
Implementation of cleansing
Planning the solution
Generally the initial step when developing the new product is always starts with initial planning. In this phase I have describe my initial planning of my original data source which is Microsoft Access and Microsoft SQL Server. And destination source which is Microsoft SQL Server between this two processes where I am doing data cleansing. there is basic information is provided in this phase but it is necessary to understanding objectives before staring of development process because clear initial description of the objectives removes doubt and can get successful solutions.
Staring with development I have extracted both source data from Microsoft Access and Microsoft SQL Server and import it into a temporary destination source which is staging area between these two processes I am solving multisource data cleansing problems into staging area and after finished that process data will be loaded into final data warehouse destination which is in Microsoft SQL Server. The process start with a database connection and import data from source and add transformation and cleaning process and load into final destination. This whole development process I have described more in details in this chapter.
Data analysis: this is initial process of cleansing. As soon as I got data source I start to find data cleansing problems from the data source. Purpose behind data analysis I can know which kind of errors available in unclean data and I can get basic idea which types of problems should be removed from the database at the time of data cleansing. There are few approaches for data analysis. A part from them I have used data profiling from data Match 2010 tool for get problem and I have also find out problems manually using SQL queries. I have put two screenshot for data analysis one from data match 2010 and other from SQL query
Creating the solutions
All SSIS solutions start in the SQL Server Business Intelligence Development Studio. This integrated development environment serves as the main hub for developing this product. Here in the dialog box to create new project I have selected integration services project to development this prodeuct.
After creating the new project, there is a user interface environment that is used to define the project workflow. The workflow is divided into three different tasks Control flow, data flow and event handling.
In the control flow area it is decided that what type of tasks will be executed of the package there are many different tasks available which show the flexibility of SSIS. In the control flow area I have defines two different tasks data flow task and execute SQL tasks and these two objects are connected each other using 'add precendence constraints'. I have used data flow task for join heterogeneous and homogenous data source and for data conversions and transformations and Purpose of using SQL tasks is to execute SQL statements from particular data source. I have executed many SQL user defined procedures in this area to clean data. And I have also defined event handlers task which execute if my package run with error event. It is not part of main package. To execute this task I have defined control flow logic.
Define control flow logic
I have take data flow task form control flow items and add the logic to this control flow task to support the data import. This task moves data between source and destination. And take the new component OLE DB Source from data flow sources where I have selected connection with source dataset which is Microsoft Access and Microsoft SQL Server and select table from the source dataset. Using connection managers I have created three connections one is for source data which in Access and another source data which is in SQL server and temporary destination staging area which is in SQL Server new connection with source data.
Define Derived column
In this phase i have select derived column component from data flow transformation for add one column (DatabaseId) to destination source because I have merge two data source in single destination so identify data i have add this column. And I have also specified value for that column in this component where number 1 for Baroda dataset which is originally coming from Access and number 2 for Rajkot dataset which origannaly coming from SQL Server. Derived column is not used to define only new column it can be used to transform particular value. E.g.; I have replace SEX field because in Access SEX store 0/1 and SQL Server store as M/F so using this component I have transform 0/1 into M/F.
Define Data conversion
Using this component I have changed data type as per necessity because I have used heterogonous dataset which is in access and I am importing it into SQL Server so all data types are not matching. So I have convert data types using this components.
This component is lookup values in the reference dataset. I have used lookup to solve uniqueness problem so it will not allow to same id to many things if user will try to enter but this component will not allow inserting. And as an output I have selected redirect row.
OLE DB Destination
This component is select from data flow destinations. I have used this component to load data into destination which is SQL Server staging area. In this part destination connection and table will be selected. I have preferred fast load option because of performance. After select mapping between input column and destination column. Here sometime copy of data conversion.column available because for that particular column I have dine data conversion so I have select that otherwise it will not load data and giving error.
Handling Data Flow Errors
In the some of cases at the time of execution multiple errors comes in the package. Some situations error was occurring before the execution stopped and report failure. When I am executing package there are two paths in the data flow green connectors' path between components is sign of successfully run package. But error path red connectors between all components it show error in component and failed execution.
To solve error in component I have used debugging which shows that what is going on at the time of package execution and I have put troubleshoot and validate processing logic. There is in control flow debugging one option available called breakpoint where I did pause the control flow execution so execution part can be observed easily and execution process viewed so debugging process is useful to me to solve error at the time of execute the package.
Parallelism and Scalability
In all above steps I have finished the process of merge two dataset where I have made new staging database in SQL Server and I have imported both dataset from source to staging area. And during the process of extract from source and load into staging I have done transformation where it was necessity which I have explained in above steps. And after I have made user define procedure for further data cleansing which I will run same way from SSIS packages. At last after execute all procedure successfully I have done testing for staging area which is explained in next chapter. In the further part of this chapter I have explained my user defined procedures where I have made separate procedure for individual table and I have tried to solve all problems related to that table.
User define procedure Customers
User define procedure Employees
User define procedure Colors
User define procedure Manufacturers
User define procedure Promotions
User define procedure Models
The two sources in the example of above are both in relational format but exhibit schema and data conflicts. At the schema level, there are name conflicts (synonyms Customer/Client, Cid/Cno, Sex/Gender) and structural conflicts (different representations for names and addresses). At the instance level, we note that there are different gender representations ("0"/"1" vs. "F"/"M") and presumably a duplicate record (Kristen Smith). The latter observation also reveals that while Cid/Cno is both source-specific identifiers, their contents are not comparable between the sources; different numbers (11/493) may refer to the same person while different persons can have the same number (24). Solving these problems requires both schema integration and data cleaning; the third table shows a possible solution. Note that the schema conflicts should be resolved first to allow data cleaning, in particular detection of duplicates based on a uniform representation of names and addresses, and matching of the Gender/Sex values.