Decision making in data warehouse

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

1- Introduction

This research explores the improvement of Decision Making in Data warehouse. Technologies such as decision support systems (DSS) are supporting in solving several sort of problems, particularly those that are based on quantitative data and/or are deliberate in scope. Managers make use of aggregate data (summary information) retrieved from their organizations' databases and data warehouses to make planned or strategic decisions to run and improve their organizations' operations. The majority of databases and data warehouses gain data errors that are occurred moreover by people or reason by systems failure. The decision quality is loss when the data error exists in the database. The finding and correcting data errors might be costly, resource- intensive, and frequently unreasonable, the need for without error data may be replaced by knowledge gained from assessment of information quality. The ideal information may not be able to attain manager from aggregate analysis but he remains several knowledge regarding the information to assist or support to change business scenarios and take suitable action. A manager raise the profit, reduce the risk and estimation the quality of information with the assist of decision making. A scenario, a manager retrieve the information and make use of the total count of active customer that who have placed order for sure product in the past and he make plan to require stock of inventory and distribution of products. The correctness and wholeness of the customer count directly force the manager's decisions on forecasting and preparation activities that could lead to above- or under-production and in-stock levels of inventory. A manager simply adjusts the planning with the assist of whole and correct information. The term data quality is a subjective idea and depends on the context and goals of the information consumers. Frequently, subjective qualitative dealings, such as low, medium, and high, are used to specify the quality of data. On the other hand, users may not contribute to the same observation as to what low or high quality data pertains to. The above examples illustrate that quantitative metrics to evaluate the information quality would lead to more objective judgments and decisions. Hence, the main goal of this work is to provide a framework where quality characteristics of aggregate data might be precise quantitatively.

1.1- Database

A database is a set of information that is ordered in such a method that a user also as computer program can rapidly select desired pieces of data. We can imagine of a database as an electronic filing system. Traditional databases are ordered into Records containing Fields. Each field contains precise information.

Database Management System (DBMS) consists of a database that stores the data, and a set of programs that control the database and fast access of data from the database. The software provides a variety of services, such as the skill to define the structure several time we called schema of the database, to access and store the data in concurrently such that more than a few users may access and store the data at the same time and distributed such that the data are stored in unusual locations within the city and out of the city and to make certain the data security and data reliability of the stored data in the database to protect against illegal access to the data or a system crash.

Relational database consist set of tables and this is the most familiar type of database. Each table is rectangular and can be supposed as being corresponding to a single flat file. The database terminology is somewhat different than that used in data mining. The tables consist of attributes also called columns or fields and tuples also called records or rows of the table. Most notably, each table is allocate a unique name, and each tuple in a table is assigned a special attribute, called a key, that describe its unique identifiers. Relational databases also contain the entity-relational ER data model, which classify a set of entities and relationships between the entities (David et al , 2003).

1.1.1- Types of Database

There are two generic database architecture.

  • Centralized Database
  • Distributed Database

Centralized Database

With a centralized database, all data are located at a single site. Users at remote sites might normally access the database using data communications facilities. Three familiar examples of centralized database, a personal computer database, a central computer database and a client/server database.

a) Distributed Database

A distributed database is a single database that is extends actually across computers in multiple sites.

There are generic categories of distributed database.

  • Homogeneous Database
  • Heterogeneous database

1.2- Transactional Database

A transactional database is a DBMS where write operations on the database are able to be rolled back if operations are not finished correctly. If a transactional database system loses electrical power or due to network breakdown or any other cause half-way through a transaction, the partially completed transaction rolled back and the database restored to the status it was in before the transaction started. Assume we are working in 2 Tier or 3 Tier applications in which front end of the application and back end or entirely in separated. Suppose that a front-end application is transfer some instructions to a database system e.g. customer sort and front-end application sends the request to product to the customer and after that it take away the product from inventory. The front-end application is send the request to create an demand for the customer and unpredictably the front-end application break down due to light failure or any other reason. When the front-end application crashes, a transactional database be able to then roll back the partially completed transactions.

1.3- Data Warehouse

Data warehouse is a collection of decision support skill, meant at enabling the knowledge worker (executive, manager, and analyst) to make better and faster decisions. A subject-oriented integrated, time-variant and non-volatile group of data in support of management's decision-making process. Usually, the data warehouse is preserve disjointedly from the organization's operational databases. There are several reasons for doing this. The data warehouse supports on-line analytical processing (OLAP), the well-designed and performance requirements of which are reasonably different from those of the on-line transaction processing (OLTP) applications usually maintain by the operational databases. OLTP applications normally automate clerical data processing tasks such as order entry and banking transactions that are the bread-and-butter day-to-day operations of an organization. These tasks are structured a repetitive, and consist of short, atomic, isolated transactions. The transactions entail detailed, up-to-date data, and read or update a small amount of (tens of) records accessed usually on their primary keys. Operational databases be liable to be hundreds of megabytes to gigabytes in size. Stability and recoverability of the database are significant, and exploit transaction throughput is the key performance metric. As a result, the database is designed to reproduce the operational semantics of known applications and in exacting to minimize concurrency conflicts. Data warehouse, in contrast, are targeted for decision support. Chronological, summarized and consolidated data is further important than detailed, individual records. While data warehouse restrain consolidated data, maybe from more than a few operational databases, above potentially long periods of time, they tend to be instructions of magnitude larger than operational databases; enterprise data warehouse are predictable to be hundreds of gigabytes to terabytes in size. The workloads are query intensive with typically ad hoc, complex queries that can access millions of records and perform a lot of scans, joins, and aggregates. Query throughput and reply times are more significant than transaction throughput. To make easy complex analyses and visualization, the data in a warehouse is normally modeled multidimensional (Fu and Rajasekaran, 2000).

1.3.1- Comparison of OLTP and Data Warehouse

Data warehouse and OLTP system have very unusual requirements. Here are some examples of differences between typical data warehouse and OLTP systems.

  • Workload
  • Data warehouse are intended to accommodate and hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to act upon well for a wide variety of probable query operations. OLTP system support only predefined operations. Your applications might be particularly tuned or designed to support only these operations.

  • Data Modifications
  • A data warehouse is updated on usual basis by the ETL process (run nightly or weekly) using bulk data alteration techniques. The end users of data warehouse do not directly update the data warehouse.

    In OLTP system, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reproduces the current status of each business transaction.

  • Schema design
  • Data warehouse frequently use demoralized or partially demoralized schemas (such as a star schema) to optimize query performance.

    OLTP systems frequently use fully normalized schemas to optimize update/insert/delete performance, and to assurance data consistency.

  • Typical operation
  • A typical data warehouse query scans thousands or millions of rows. For example, "find the total sales for all customers last year."

    A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."

  • Historical data

Data warehouse typically store several months or years of data. This is to sustain historical analysis. OLTP system typically store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

1.3.2- Schemas in Data warehouse

A schema is a set of database objects, including tables, views, indexes, and synonyms. There is a variety of ways of organize schemas objects in the schema model deliberate for data warehousing. One data warehouse schema is a star schema. There are other schema models that are normally used for data warehouses. The most common of these schema models is the third normal form (3NF) schema. Furthermore, several data warehouse schemas are neither star schemas nor 3NF schemas, but instead share characteristics of both schemas. These are referred to as hybrid schema models.

The Oracle 10g database is designed to maintain all data warehouse schemas. Some features may be precise to one schema model (such as the star transformation features, described in "Using Star Transformation", which is precise to star schemas). On the other hand, the vast majority of Oracle's data warehousing features are equally appropriate to star schemas, 3NF schemas, and hybrid schemas. Key data warehousing competence such as partitioning (including the rolling window load technique), parallelism, materialized view, and analytic SQL are implemented in all schema models. Star Schema

The Star Schema is maybe the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resemble a star, with points radiating from a central table. The center of the star consists of a large fact table and the points of the star are the dimension tables.

A star schema is characterized by one or more very large fact tables that contain the most important information in the data warehouse, and a number of much smaller dimension tables each of which contains information regarding the entries for a particular attribute in the fact table.

A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer be familiar with star queries and generates well-organized execution plans for them.

A star join is a primary key to foreign key join of the dimension tables to a fact table. The main advantages of star schemas are that they.

  • Provide a direct and instinctive mapping between the business entities being analyzed by users and the schema design.
  • Provide highly optimized performance for usually star queries.
  • Are broadly supported by a large number of business intelligence tools, which may anticipate or even require that the data warehouse schema contain dimension tables. Snowflake Schema

The snowflake schema is a more complex data warehouse model than a star schema, and is a sort of star schema. It is called a snowflake schema. Snowflake schema normalizes dimensions to eradicate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. It increase the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance.

1.3.3- Oracle 10g construct data warehousing easy to implement

The benefits of data warehousing, which can be organize in the service of high-performance system correction, are obvious. And enabling the variety of data-intensive applications that are usual of data warehousing for use over the Web can be a major strategic advantages. Bringing Oracle and data warehousing together is almost always a great idea, and it isn't the strategic challenge that it might appear to be.

1.3.4- Oracle 10g be keen on Data Warehousing

The reason for this ideal match is three-fold. First, the architectural improvement to Oracle Database that were introduced with 10g are ideally suited to data warehouse structure, storage requirements, and I/O patterns. Second, the suite of products available for data warehouse implementation and management build on Oracle skill that you, as an Oracle user, are already likely to know. Third, Oracle's Enterprise improvement specifically enable efficient data mining and other business intelligence functions that you'll want to use once you have a data warehouse in place.

What's the difference between an Oracle database and a conservative relational database? And why is it best for data warehouse implementation? It's important to have a very solid grasp of this as you design your data warehouse environment.

  • Partitioning must be much more refined in data warehouse database design than in conventional RDBMS design in order to maintain efficiency. Oracle 10g employs a partitioning technique that particularly facilitates administrative management of data storage to partition by selected values.
  • Conventional online transaction processing (OLTP) systems do several reads of small amounts of data. Data warehouses do rare reads of large amounts of data. Oracle 10g authorizes table-joining via bitmapped indexing to make complex arrays that may be called yup with least I/O.
  • In a conventional RDBMS, transformation of data loading is usually done inside and is frequently a custom-programming suggestion in a data warehouse environment, there is no instance for this. Oracle 10g authorizes you to define transformations outwardly greatly facilitating the exploitation of inbound data for integration.

1.3.5- Tools

Here are the Oracle tools that will make possible our warehouse implementation. Note that not all of them are warehouse-specific; this speaks to the integrated architectural advantages of an Oracle database and Oracle's concepts of warehouse management.

1.3.6- Warehouse Builder

This is one of the major, baddest wizards you've forever used. It's hard to visualize that something as large and complex as a data warehouse might be built by a wizard, but since Oracle is building on top of structures already accessible and well understood, it's almost insultingly painless.

Warehouse Builder allow you to define your data sources, implement data flow between sources and destination (Extract-Transform-Load, or ETL); design and deploy the suitable schema define all of your tables with simple dimensioning and importable definitions and design and generate your query environment, including OLAP. Do you lose anything by following this "canned" approach. The GUI gives you all option you could want, and using Warehouse Builder outcome in a metadata repository that you will find very expedient later on ( e.g., storing your whole warehouse design in one place where it can be easily referenced).

1.3.7- Discover

You can set up and check reporting with this combination administration/viewing tool. The power of this tool is that it works with browsers and is incorporated with Oracle portal. With that, both internal and external users have an immediately accessible Web reporting mechanism that taps the warehouse directly.

1.3.8- Enterprise Manager

Oracle's data warehouse database architecture is so well incorporated that Enterprise Manager, which can be used on any database, is an perfect management tool. You can handle data transportation, backup and recovery, resource management, system monitoring, an every other administrative job from the OEM Console.

1.3.9- Application Server

Oracle's super power internet app suite in your new data warehouse leading in the whole. Internet and interior security, caching , portal setup/management , site usage intelligence , messaging , and J2EE or all handle through 10g as (Discover and Enterprise Manager are actually part of it). Again, because of the deep integration of these services with Oracle database technology in general, with which your new warehouse is fully friendly, you have all these goodies accessible to you right out of the box. Oracle Server facilities transformation of inbound data via a staging process, which is mostly use full in data warehouse ETL step. Migration of data from tables in your predictable OLTP database into your data warehouse is particularly well-situated and can be proficient with Java-stored procedure or SQL (or PL/SQL).

Oracle also has automatic memory management, which particularly useful in data warehouse usage. Because of the massive amount of data usually used in data warehouse analytic and the constant fluctuations in table size keeping a system tuned would be a full-time job for your DBA if it wasn't handled automatically; so Oracle built it in.

1.3.10- Getting Friendly

Oracle's Internet approach is second to none. The suite of service-based products (in Oracle Application Server) is amazing and basically is designed to make possible an "in-house". Internet architecture (via Oracle HTTP Server) that can be opened up via portal to outside world. That is, you can organize with comparative simplicity a company-wide Internet for data-gathering, analysis, and reporting purposes; then you can describe services that are helpful to customers and external users and create those services accessible on a public Web site. Oracle is nothing if not Internet-friendly. And all of this Internet power is fully well-matched with the data warehouse you'll deploy next to it. Finally, we're left with a vision of Oracle 10g and Oracle data warehousing, in partnership, expand the general Oracle technological job of Web friendliness into the sometimes intimidating domain of data warehousing , where data flows like the Mississippi. If this is where you live , you no longer require to fear data warehousing . Oracle saw you headed in that direction to begin with.

1.4- Decision Support System

Decision support systems (DSS) are useful in solving several sorts of problems, particularly those that are based on quantitative data and/or deliberate in scope. For strategic decisions, however, decision makers can advantage greatly from a tool that tracks and systematize qualitative and other unformulated information. Such a tool would assist cultivate and leverage an organization's intellectual assets to assist user's information decision making is a more informed style. While DSS technologies have not usually been used in such situations, they can be adapted to do so.

When the production rate is restricted that is the most assistance simply reflect on a single buyer. For illustration, Banerjee analyzed the integrated vendor-buyer model when the items produce at a limited rate by the vendor. He scrutinizes a lot-for-lot model in which the vendor manufactures each buyer shipment as a separate batch (Banerjee, 1983).

The shipments might be made before completion of the entire lot. He also integrated an alternative plan in which the quantity to be delivered from the vendor is not indistinguishable at every replenishment. Instead, at each delivery all the accessible inventory is supplied to the buyer. This plan was based on a preceding argument for solving a single-vendor single-buyer scheme with unlimited production rate at the vendor. This new scheme occupy successive shipment sizes within a lot which are increased by a factor equal to the ratio between the insist rate on the buyer and the vendor's production rate (Goyal, 1995).

Hill showed that neither the equal shipment size plan and nor the rising shipment size plan are always most favorable. He took Goyal's scheme a stage auxiliary by considering successive shipment sizes which are increased by a common fixed factor. The ranges of the factor from 1 to an higher bound which correspond with the proportion between the demand rate and the production rate. Therefore, both the equal shipment size policy and Goyal's policy represent special cases of this new policy. When the factor is equal to 1 then the equal shipment size of policy is successive (Hill, 1997).

In the business field the combine of decision-support systems and data warehouses has led organizations to build up gigantic collections of data. The concise information of data is retrieved through the aggregate queries many data items cover through aggregate and return small result. OLAP queries, used extensively in data warehousing, are based almost entirely on aggregation (Gupta and Mumick, 1999).

Aggregate queries have been studied in a assortment of types. Recently there has been much significance in querying and examine stream data. Such analysis often requires aggregate queries; in order to store running statistics on the data need aggregate queries. The difficulty of processing aggregate queries above streams. For networks of sensors, which produce streams of measurements, aggregate queries were studied as a data-reduction tool. Data reduction is vital in sensor networks, while the cost of communication is often high (Madden et al, 2002).

An improved understanding of these troubles requires a complete characterization of equivalences among aggregate queries. The earliest time, characterizations for deciding equivalence that applies to a large and considerable class of aggregate queries (Nutt et al, 1998). The aggregate queries characterizations were unlimited with disjunctions and queries with negation (Sagiv et al, 2005).

DSS research differentiates between model-based DSS and data-based DSS. Data based DSS usually use an inductive approach, such as data-mining, to derive decision rules from unrefined data. They are less suitable for unstructured decisions (Hand et al, 2001).

Model-based DSS deductively describe on decision models leading to decision rules. The main objective of decision models incorporated in model-based DSS are statistical models, simulation models, and heuristic models, optimization models (Pearl, 2000).

Multi-attribute decision analysis procedure lies in the field of research that has been developed to compose preference decisions (such as assessment, collection, prioritization and so on) over the obtainable alternatives that are characterized by multiple, typically contradictory, criteria. One of the other procedure up to date developments in multi-attribute decision analysis is the use of an evidential reasoning approach based on the Dempster-Shafer theory. The Dempster-Shafer theory models risk by using the concept of the plausibility of a unhelpful outcome, and by confine both exact data and different types of uncertainties (Sun and Srivastava, 2006).

The degree of hesitation is high and the expenses are irretrievable, there is a compromise that actual options theory can be applied to capture the financial value of managerial flexibility in IT infrastructure projects. The skill of manager to remain and sight in the event of hesitation , authentic option analysis can helpful to identify the several favorable staging of investment, and can also make scope for further learning about future payoffs before a final decision. A rising volume of research has been performed into IT investments and actual options and the use of multi-attribute decision analysis to account for reservations (Salling et al, 2007).

The unordered sets of information and queries return unordered sets of values or tuples in traditionally databases. On the other hand, it is most significant to gain the ordering or ranking of members of the answer set. The most widespread applications include search engines where the be eligible candidates to a specified query are ordered based on some priority criterion ranking-aware query processing in relational databases and network monitoring where top ranking sources of data packets require to be recognized to detect denial-of-service attacks. The ranking of query answers is not only related to such applications, but is also critical for Online Analytical Processing (OLAP) applications. The decision making is improved results provide when the aggregate ranking is more accurately (Tao Y., 2004).

Netsourcing is broadly understood as accessing centrally handle business applications accessible to multiple users from a shared capability above the Internet for rent or pay per use. As it also does for more common outsourcing, this presents a rising sourcing chance for companies. Given this sourcing chance, netsourcing has more and more gained importance in the decision range of IT managers. Management is ever more confronted with formless and complex decisions, 1 for instance whether and how to netsource (Mclvor and Humphreys, 2000).

An attractive decision crisis arises whenever a product wants to be supplied by a vendor to multiple buyers. This inventory/distribution system is frequently referred to as the single-vendor multi-buyer difficulty and it has been analyzed widely in the literature allowing for unlimited production rate. However, when the production rate is limited most contributions just consider a single buyer. For instance, Banerjee analyzed the incorporated vendor-buyer model where the vendor produces the goods at a limited rate. He observes a lot-for-lot model in which the vendor manufactures each buyer shipment as a separate lot. Further widespread Banerjee's model by relaxing the assumption of lot-for-lot rule for the vendor. He illustrate that manufacturing a lot which is made up of an essential number of equal shipments generally produced an inferior cost solution. Goyal's model was derived based on the supposition that the vendor can supply to the buyer only after finishing the entire lot size (Goyal and Nebebe, 2000).

The incomplete data in multidimensional databases focuses on rough data, where facts are recorded with different granularities. The analysis of atomic values is satisfactory for analysis of this sort of data. The deal with modeling and querying probabilistic multidimensional data. The develops a probabilistic model and query method that can take in distributions, but does not deem pre-aggregation (Burdick, 2005).

The take out knowledge from vast volumes and different sources of data through data mining standard this knowledge is modeled and represented as patterns. We can regard as these patterns as complex objects. Our approach is rather devoted to pair data mining and OLAP in order to make new on-line analysis techniques for complex data. The finally note that the previous presented works confirm that join together data mining to OLAP is a capable way to involve rich analysis tasks. They resume that data mining is able to extend the analysis power of OLAP tools (Dalvi et al, 2004).

1.5- Significance of Computer

We are alive in the computer era. Computer is being used in every sphere of life. Its uses are increasing day by day. It appears that after a few years; life would be ineffective without computer. There is not even a single field that is beyond the grip of computer. The computer helps to make files in a short time , more easily and accurately reducing the labor, stationary expenditures, store records and access to the fastly of an account holder. It is because of this invention that the world has reduced to a global community. Wars are won and lost from the computer. It is also used in manufacturing. Welding, assembling and painting of cars and buses. This radical development has trim down the mental and physical burden of mankind. The computer has a very important rose in the medical sciences. Computers are being used to assist doctors in diagnosing diseases. By placing the medical history of a patient in data bank. One or more doctors can retrieve and update when needed. Better information about patient's medical background enables doctors to explore potential health problems and to detect illnesses-Control of physiological status of the patients; control of laboratory test at the hospital is important application of the computers.

1.6- Computer Based System

A system, which utilizes electronic means especially computer, is a computer-based system. As computer is one of the powerful tools used in the present society and it has a strong effect on human lives. That is why the use of computerization is increasing day bay day.

1.7- Problem Statement

A number of techniques have been developed for enhancing decision making but some draw back is exist that directly force on decision quality. The first one technique is a single vendor and single buyer in supply chain. While he decide the vendor's compliant and yield rates, it manipulate the vendor-buyer decisions about the production lot size and number of shipments delivered from the vendor to buyer. It follows, then, that these decisions must be determined concurrently in order to control the supply chain total cost. Here the vendor is deliver goods to buyer in number of unequal-sized shipments. Furthermore, every outgoing item is examined, and each item failing to meet an inferior requirement limit is reprocessed. Our goal is reducing cost of goods and maximizes the profit. But here the business is not expanded and we attain at a time one order and shipments it. The other technique is one vendor supplies an item to multiple buyers. The vendor produces the item at a limited rate and customer require occurs at each buyer at a stable rate. The goal is to decide the order quantities at the buyers and the production and shipment schedule at the vendor in order to reduce the average total cost per unit time. But a few time data error take place either by people or reason by system failures. This research anxiety handling data errors and improve their organization's operation. Hence, the main goal of this work is to provide a framework where quality characteristics of aggregate data might be measured quantitatively.

1.8- Developed System

This thesis presents the improvement of Decision Making in Data warehouse. Our approach is use aggregated data (summary information) retrieved from their organizations databases and data warehouses to make planned or strategic decisions to run and improve their organizations' operations. We introduce to contain data errors that are arise either by people or reason by systems failure. The existence of data errors directly forces the decision quality. The finding and correcting data errors could be costly, resource- intensive, and often unreasonable, the need for without error data may be replaced by knowledge gained from assessment of information quality. Hence, the main goal of this work is to provide a framework where quality characteristics of aggregate data might be measured quantitatively.