Importances of data mining and data warehouse in database management systems

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Definition of database is an electronic store of data. Basic terms used to describe a structure of a database as entity, data, attributes, entity set and relationship between entities. Another definition of database is a special kind of software application whose main purposes is to help people, store programs, retrieve information and organizes information. A person, event, place, or item is called entity. The facts that describe an entity are known as data. Each of entity that are described by it characteristics are known as an attributes. All entity set is all related entities that are collected together to form. It set is given a singular name. The database is a collection of entity set. The entities in database are likely to interact with other entities. Relationships are interactions between the entity set. Relationship is a set of related entities, where it is one-to-one, one-to-many and many-to-many.


It can be conclude as where DBMS software package such as Microsoft Access, Oracle, SQL Server, Visual Fox Pro and so forth. A user-developed and implemented database or databases includes a data dictionary and also other database objects. Data-entry forms, queries, blocks, and programs are such as a custom application. Hardware is includes personal computer, minicomputers and mainframes in a network environment. An operating system and a network system is defines as software. This entire element of DBMS is can be mapping Figure 1.

What is Data Mining?

According a research done Data Mining and Data Warehouse by Mento, B and Rapple, B (2003) data mining been defines by the respondent as technology that used by the institution that 40% of respondent defined. But in the same research done by both author scopes respondent in the libraries believed data mining could be a valuable tool in facilitate library users for the next future technologies. Otherwise, based on research to others institutions which concluded that these large repositories of full text and numeric data would offer data mining opportunities that would gives an advantage from expertise found in libraries. This author also included a definition that defines from First International Conference on Knowledge Discovery and Data Mining which is "data mining is the process of selection, exploration, and modelling of large quantities of data to discover regularities or relations that are at first unknown with the aim of obtaining clear and useful results for the owner of the database".

According to Kantardzic, M. (2003) another author data mining is which compare definition by verbs means to mining operations that extract from the Earth her hidden and a point of view in scientific research its means a relatively new disciplines that has developed mainly from studies carried out in other disciplines. As for statisticians, they saw data mining as "data fishing', 'data dredging' or 'data snooping'. Data mining aiming is to examine databases for regularities that may lead to be understanding of the domain describe by database. As known database is an organised and typically large collection of details data facts that concern domain in the world. Other definition by another author been given, some defines as an iterative process within which progress is defined by discovery, through either automatic or by manual methods. Data mining also the most useful in an explanatory analysis scenario in which there are no predetermined notions about what will constitute an interesting outcome. Search for new, valuable, and nontrivial information in large volumes of data consider as data mining. It is cooperative effort of humans and computers. Best of result are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers.

What is Data Warehouse?

Data warehouse defines as a collection of integrated databases designed and a subject-oriented to sustain the decision-support functions (DSF), which is each unit of data, is relevant to some moment in time. Although, data warehouse means a different things to different people, it is relates to limited to data, others refer to people, processes, software, tools and data. One of the functions is to store the historical data of an organization in an integrated manner that reflects the various facets of the organization and business. Data warehouse can be viewed as an organization's repository of data, set up to support strategic decision-making. Even data in data warehouse is not update but used only to respond to queries from end to users who are decision-makers. Two aspects in data warehouse is specific types of data in terms of classification and the set of transformations used to prepare the data in final touch that is useful in decision making.


Data mining concepts can be looks at the definition which related to "process" that relies in the notion of matching problem to technique. It is also not simply a collection of tools that isolating each completely and waiting to be matched to problem. Jiawei, Han. (2006) has stated some general experimental procedure adapted to data-mining problems which involves the following steps:

1. State problem and formulate hypothesis: modeller usually specifies a set of variables for unknown dependency and if possible a general form of this dependency as an initial hypothesis. It also required a combination expertise of an application domain and data mining model at the first steps.

2. Collect data: involves data-generation that first approach as designed experiment (under control of modeller) and observational approach which is more to assuming most data mining application includes setting, namely and random data generation.

3. Preprocessing data: which in preprocessing it will involves data that at least has two common tasks as outlier detection and scaling, encoding and selecting features. Good preprocessing method provides an optimal representation for a data mining technique by incorporating a priori knowledge in the form of application-specific encoding and scaling.

4. Estimate model: involves of selection and implementation of an appropriate data mining techniques as the main. Process of estimating model is not straight forwarding based on several models and selecting the best one is an additional task.

5. Interpret model and draw conclusions: models is needed to be interpretable in order to be helpful where goals of accuracy of the model and accuracy of its interpretation are somewhat contradictory. Simple model are more interpretable but also less accurate. Data mining methods expected to yield highly accurate results using high-dimensional models. Good understanding of the whole process is important for any of successful application. It can be figure as above:

Data warehouse is not a prerequisite for data mining, especially for some large companies, is made easier by having access to a data warehouse. The primary goal of data warehouse is to increase the "intelligence" that involves in decision making process and knowledge. Data warehouse hold a huge and a billion of records are stored. There are two important aspects that should be understood of its design process that is the specific types (classification) of data storage in a data warehouse and a second is the set of transformations used to prepare data in the final form. Categories of data in data warehouse where the classification is accommodated to time-dependent data sources are detailed data, current detail data, lightly summarized data, highly summarized data and metadata.

There are four main categories in transformation and each of it has its own characteristics:

1. Simple transformations: manipulation of data that focused on one filed at a time. Without taking into account its value in related field.

2. Cleaning and scrubbing: a proper formatting of address information, including checks for valid values in a particular field, usually checking the range or choosing from an enumerated list.

3. Integration: a process of taking operational data from one or more sources and mapping it, field by field, onto a new data structure in data warehouse. This situation occurs when there a multiple system sources for the same entities and there is no clear way to identify those entities as the same.

4. Aggregation and summarization: A method of condensing instances of data found in the operational environment into fewer instances in warehouse environment. Summarization is a simple addition of values along one or more data dimensions while aggregation refer to additional of different business elements into a common total and it is a highly domain-dependent.

Data warehouse can be a point solution that been used to satisfy a specific need. Common data resource has a number of functional groups. Although its look easier in implementing with minimal data modeling effort. A data warehouse has to be faithful to such embedded data meanings. Data warehouse also consume substantial investment in time and funding. Basic elements of data warehouse are operational source systems, data staging area, and data presentation area and data access tools.

1. Operational source systems are regarded as an operational system of record that captures the transactions of the business. The main priorities of source systems are processing availability and performance. Each of source systems has been made to sharing common data such as customer, geography, products, or calendar with other operational systems in the organization.

2. Data staging area is both storage area and a set of processes generally referred to as an extract-transformation-load (ETL). It involved everything between the data presentation area and an operational source system. Key architectural requirement for the data staging area is does not provide query and presentation services and it is off-limits to business users.

3. Data presentation is where data is made available, stored and organized, for direct querying by people, reports user, and other analytical applications. Data In the queryable presentation area of the data must be atomic to the data warehouse bus architecture, must be dimensional, and also must adhere.

4. A data access tool is the major component of the data warehouse environment. It can provide to business user to weight the presentation area for analytic decision making process. It can be as simple as an ad hoc query tools or as complex as sophisticated data mining or modelling application.

Characteristic of data warehouse can be summarized in three-stage data-warehousing development process that includes modeling, building and deploying. Firstly, modeling is in a simple terms where to take time to understand business processes, the information requirements of these processes and the decisions that are currently made within processes. Building is a stage to establish requirements for tools that suit the types of decision support necessary for the targets business process. It also to create a data model that helps further define information requirements and also decompose problem into data specifications and the actual data store, which will in its final form, represent either a data mart or comprehensive data warehouse. Deploying is a stage where to implement in early in the overall process, the nature of the data to be warehoused and several of business intelligence tools to be employed to begin by training users.

Data in data warehouse is able to be used for many different purposes, including waiting and sitting for future requirements which are unknown today. Data warehouse is oriented to major subject areas of the corporations that have been defined in the high-level corporate data model including account, customer, product, transaction or activity, and policy.


Data mining and data warehouse are been used for data analysis applications in area of finance, retailing, web services and so forth. In data mining there are several technique involves even tough data mining support knowledge discovery which it take a process of data cleaning, data transformation, data integration, data mining and evaluation and presentation. Association analysis which involves discovery of association rules that occur frequently together in a given set of data that showing attribute-value. An association rule is usually and basically used for prediction.

Another technique is classification and a prediction which is consists of two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. A prediction model can be built in to predict the expenditures of potential customers on computer equipment given income and occupation. These techniques find a set of models that describe the different classes or objects. It is also can be used to predict the class of an object for which the class is unknown.


Other than that is clustering, where involves grouping subject so that objects within a cluster have high similarity but are very dissimilar to object in other clusters. This based on principle of maximizing the intraclass similarity and minimizing the interclass similarity. Cluster analysis has been extensively studied many years, focusing primarily. These techniques have been built into statistical analysis package.

Outlier analysis which is a database that contain data object that do not comply with general model or behaviour of data. Outlier is useful for applications such as fraud detection and network intrusion detection. There are two types of approaches that is statistical based outlier detection and distance based outlier detection,


Benefits of data warehouse can be concluded as below:

* Support strategic decision making: by providing summary and detail data that can be used for trend analysis, statistical analysis, performance measurement comparisons, correlation among disparate facts and other similar requirements.

* Support integrated business value chain: by supporting a single source of authoritative, accurate, consistent and timely data that cuts across traditional departmental applications where opportunity exist to provide consistently-defined data and reduce redundant efforts.

* Empower workforce by access to data empowers business users and improves analysis capabilities. This is enable users to be more self-sufficient and reduces the dependence on time-consuming secialized report development. It will enable organizational streamlining by simplify data flows enabled by better access to shared data.

* Speeds up response time to business queries: it enable faster response to business questions. Response time for data retrieval can be reduced from days to minutes.

* Data quality: where a consolidated data store will eliminate reconciliation of inconsistent data. Analysis and transformation of source data to the data warehouse, data quality improvements can be made. The best data in company is the record of how much money someone else owes the company. It is a "driver business engineering" where frequently data element would be interesting if it were of high quality, but wither isn't collected at all or it is an optional.

* Document's organizational knowledge: a well documented and centralized data stored reduce organizational vulnerability caused by concentrating analysis expertise and the understanding of data in a few staffers with institutional knowledge.

* Streamlines systems portfolio: helps streamline systems by removing decision support functions and moving historical data out of operational systems into data warehouse. It can help to address legacy system deficiencies and support the transition to a new client/server platform.

A survey been done on data mining and data warehouse in library perspective. In this research survey that has been done by Mento, B. & Rapple, B. (2003) from library data mining and data warehouse operations. In this research it also divides the benefits into some factors such as staffing, training and budget. In this survey, it stated that three libraries have developed a data warehouse of social science data to enhance user's learning and research. Another stated that its data mining operations have spawned new research. Some of the libraries mentioned that administrative sphere from the data mining and data warehousing operations. Another mentioned that Web log data mining can point to areas where users might benefit from instruction in using the particular search tools.

Another respondent in this research pointed that their library's custom-created software and its crawler/classifiers that greatly improve the gathering and subsequent evaluation of relevant and quality Internet resources. It also helped in making better serial cancellation, budget, workflow, collection development, collection weeding, OPAC design, and Web development decisions. It also helps in evaluating databases and other resources, in determining user needs, in monitoring system performance and usability, in developing forecasts, in making policies and improving Web security. Library that using data mining are primarily doing for such administrative purposes as facilitating the collection and analysis of, for example acquisition, web usage, circulation and other diverse patron data. As a conclusion, the researchers highlight a growing participation by libraries in creating such data warehouses. Libraries are taking a leadership role in creating and managing data warehouse for both research purpose and administrative. And based on this survey also librarian recognize data mining techniques as offering new approaches to analyzing content and knowledge discoveries within large database and the Web. Moreover, widespread availability of data mining software provides a new avenue for libraries to explore data mining's potential in both academic research and decision-making.


As for conclusion, data mining is represent one of the major applications for data warehousing, since the function of a data warehouse is to provide information for end users for decision support. Data mining process provides end-user with the capacity to extract hidden, nontrivial information. There are also reasons why data warehouse as a source of data for a data-mining process. One of the fasters growing fields in the computer industry is data mining. The strength of data mining is reflected in its wide range of methodologies and technique that can be applied to a host problem sets. Natural activity to be performed on the large data sets, one of the largest target markets is the data warehousing encompassing professionals' and decision support community.

Data mining also can be applied in various field scope and this techniques can be applied to problems of business process reengineering. Understanding related to interactions and relationships among business practices and organization. An important method for extracting information from all sizes including small and large data, the factor that making develop a data mining model a potentially lengthy process is the gigantic amount of data that must be processed when the mining or the intricacies of testing and validating models, sampling massive databases, and also large number of models that must be built to explore complex data bases.