This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
This paper presents the concepts of data warehouse and data mining. It also includes the data mining tools and techniques. Implementation of data warehouse and data mining gives advantages to the organizations or individuals by understand the business trends and make better forecasting decisions. The benefits of data warehouse and data mining affected users, organization, and people itself in term of immediate work-process, reduce task in completing work, bring better products to market in a timelier manner etc. In spite of benefits, there are also several issues that related with data warehouse such as the logical transformation of the data, including data warehouse modeling and de-normalization of the data and the issues associated with physical transformation of the data. There are many and dissimilar benefits and issues discuss related to data warehouse and data mining that practice of organization. All are different and depends on nature of organization and how the organization implements the data warehouse and data mining concepts.
At this era of globalization, we are dealing with information, information technology, information system and economy where knowledge and information has defined as power. Over the past years, data warehouse and data mining has been used in organizations. The history of data warehouse started in the early 1990's where organizations began to achieve competitive advantage by building data warehouse system.
A data warehouse is a "subject-oriented, integrated, non-volatile, time variant collection of data in support of management decisions [Bill Inmon]."
Subject oriented means that all relevant data about a subject is gathered and stored as a single set in a useful format;
Integrated refers to data being stored in a globally accepted method with consistent naming conventions, measurements, encoding structures, and physical attributes. It means it present a united view of all data elements with a common definition;
Non-volatile means the data warehouse is read-only. Data is loaded into the data warehouse and accessed there;
Time variant data The relevance of time-variant is in the sense of data getting added on as time goes on. Time being the most important element, etc.
Data warehouse is a concept. It is a set of hardware and software components that can be used to better analyze the massive amounts of data that companies are accumulating to make better business decisions. Data warehouse is not just data in the data warehouse, but also the structural design and tools to collect, query, analyze and present information. Data mining can be defined as a process of analyze the data content in data warehouse deeply and discovering some aspect of knowledge you never suspected to exist or present in your data.
Operational vs information data
Operational data is the data you use to run your business. This data is what typically stored, retrieved, and updated by your Online Transactional Processing (OLTP) system. An OLTP system may be, for example, reservation systems, an accounting application, or an order entry application.
As for informational data, it is created from the wealth of operational data that exists in your business and some external data useful to analyze your business. Informational data is what makes up a data warehouse. Informational data is typically:
Summarized operational data
De-normalized and replicated data
Infrequently updated from the operation systems
Optimized for decision support applications
Possibly "read only" (no updates allowed)
Stored on separate systems to reduce impact on operational systems
Archival, summarized, calculated data
Application by application
Subject areas across enterprise
Nature of Data
Static until refreshed
Data Structure &Format
Complex; suitable for operational computation
Simple; suitable for business analysis
Moderate to low
Updated on a field-by-field basis
Accessed and manipulated; no direct update
Highly structured repetitive processing
Highly unstructured analytical processing
Sub-second to 2-3 seconds
Seconds to minutes
Figure 1.1 Operational data vs decision support data
Business Intelligence at the Data Warehouse
The data warehouse essentially holds the business intelligence for the enterprise to enable strategic decision making. The data warehouse is the only viable solution. Organizations use business intelligence (BI) in order to enhanced understand their own business and to join that understanding to their strategic decision making. In fact, business intelligence is a key term that refers to a variety of software applications used to analyze an organization's "raw data". Other than data warehouse, business intelligence also as a discipline is made from several related activities, including data mining, online analytical processing, querying and reporting.
At high level of interpretation, the data warehouse contains critical measurements of the business processes stored along business dimensions. From where does the data warehouse get its data? The data is derived from the operational systems that support the basic business processes of the organization. In between the operational systems and the data warehouse, there is a data staging area. In this staging area, the operational data is cleansed and transformed into a form suitable for placement in the data warehouse for easy retrieval.
Figure 1.2 The position of data warehouse system in Business Intelligence
Data warehouse vs Data mart
A data mart is a simply a mini-data warehouse. It is a scaled down deployment of data warehouse that contains data focusing on a departmental user's analytical requirements. Take bank for example, Bank X set up a data mart for its general ledger system, to get the ledger system's functional information to the bank's financial analysts and budget coordinators quickly. It means that the data collected can be extracted either straightly from the data warehouse or from individual repositories (which is where a data warehouse extracts its data). Generally, data warehouse gathers a wide range of data types, while a data mart specifically involves only data the user will want.
By using data warehouse, of course it requires money, time and managerial effort. Therefore from statistic many organizations or companies start their entrance into data warehouse by focusing on smaller and more convenient data set. The concept of data mart also is a subset of a data warehouse for a single department or function. In another word a data mart is a division of an organizational data store, normally oriented to a specific purpose or major data subject that may be distributed to support business needs. Data marts are analytical data stores designed to focus on exact business functions for a exact society within an organization.
Figure 1.3 How data mart helps data warehouse systems to yield to decision support information system
Online Analytical Processing (OLAP)
Application is intended to provide end users an ability to perform any business logic and statistical analysis that is relevant. This analysis must happen fast, it must deliver responses to users within about five seconds, with the simplest analyses taking no more than one second and very few taking more than 20 seconds. OLAP is element of the broader category of business intelligence, which also encompasses relational reporting and data mining. In addition, the typical applications of OLAP includes in business reporting for sales, business process management, management reporting, marketing, budgeting and forecasting, financial reporting and similar areas. Moreover, the term OLAP was shaped as a minor adjustment of the traditional database system term OLTP (Online Transaction Processing). OLAP has been configured by databases by using a multidimensional data model which allowing for complex analytical and ad-hoc queries with a rapid execution time.
Figure 1.4 OLAP Architecture
Figure 1.5 Multidimensional Concept in OLAP
Multidimensional Concept in OLAP
Multidimensional databases are non relational Database Management System (DBMS) products that are specialized for use for the kinds of queries in data warehouse. This is in contrast to using specialized analysis tools that run on top of a traditional Relational Database Management System (RDBMS). Multidimensional data structures can be implemented with dimensional databases or extended RDBMS's. Relational databases can support this structure through specific database designs (schema), such as "star schema", intended for multidimensional analysis and highly indexed or summarized designs. These structures are sometimes referred to as relational OLAP (ROLAP) based structures.
Figure 1.6 Simple STAR schema
Concept of data mining
In today's world, an organization generates more information in a week that most people can read in a lifetime. It is humanly impossible to study, decipher, and interpret all the data to find useful information. This scenario has contributed in developing of data mining in business.
Figure 1.7 Decision support progresses to data mining
Data mining tools
In recent times, many data mining tools suitable for a wide range of applications have appeared in the market. We are seeing the maturity of the tools and products. Data mining needs substantial computing power. Parallel hardware, databases, and other powerful components are becoming very affordable.
Major data mining techniques
There are many different ways of classifying the techniques. Someone new to data mining may be totally confused by the names and descriptions of the techniques. Even among practicing data mining consultants, no uniform terminology seems to exist. Many data mining practitioners seem to agree on a set of data mining functions that can be used in specific application areas. Various data mining techniques are applicable to each type of function. Here are three major types of data mining techniques:
This technique is designated as undirected knowledge discovery or unsupervised learning. In this technique, it searches for groups or clusters of data elements that are similar to one another. You expect similar products to behave in the same way. Then you can take a cluster and do something useful with it. When the mining algorithm produces a cluster, you must understand what that cluster means exactly. Only then you will be able to do something useful with that cluster. This technique preserves the concept of one database. This option is good for incremental growth.
This technique applies to classification and prediction. The major attraction of decision trees is their simplicity. By following the tree, you can decode the rules and understand why a record is classified in a certain way. Decision trees represent rules. You can use these rules to retrieve records falling into a certain category.
Memory Based Reasoning
This technique uses known instances of a model to predict unknown instances. It maintains a dataset of known records. The algorithm knows the characteristics of the records in this training dataset. When a new record arrives for evaluation, the algorithm finds neighbors similar to the new record, and then uses the characteristics of the neighbors for prediction and classification. For solving a data mining problem using this techniques, you are concerned with three critical issues which are:
selecting the most suitable historical records to form the training or base dataset
establishing the best way to compose the historical record
determining the two essential functions, namely, the distance function and the combination function
Figure 1.8 Data mining functions and application areas
Benefits of data mining and data warehouse to organization
Data warehouse and data mining can be the key differentiator in many different organizations. At present, some of the most popular data warehouse applications include sales and marketing analysis across all industries. It also includes inventory turn and product tracking in manufacturing. Furthermore, it also has category management, vendor analysis, and marketing program effectiveness analysis in retail. Both data warehouse and data mining help and assist organization to have better information and classify it so it will contribute in gaining profit and efficiency.