Concept of data mining and warehouse

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract: Data mining can discover information hidden within valuable data assets. Knowledge discovery, using advanced information technologies, can uncover veins of surprising, golden insights in a mountain of factual data. Data mining consists of panoply of powerful tools which are intuitive, easy to explain, understandable, and simple to use. These advanced information technologies include artificial intelligence methods (e.g. expert systems, fuzzy logic, etc.), decision trees, rule induction methods, genetic algorithms and genetic programming, neural networks (e.g. back propagation, associate memories, etc.), and clustering techniques. The synergy created between data warehousing and data mining allows knowledge seekers to leverage their massive data assets, thus improving the quality and effectiveness of their decisions. The growing requirements for data mining and real-time analysis of information will be a driving force in the development of new data warehouse architectures and methods and, conversely, the development of new data mining methods and applications.

Keywords: Computer software, Data mining, Data structuring, Knowledge-based systems


Data mining is concerned with discovering new, meaningful information, so that decision makers can learn as much as they can from their valuable data assets. Using advanced information technologies, knowledge discovery in databases, can uncover veins of surprising and golden insights in a mountain of factual data. Data warehousing is a methodology that combines and coordinates many sets of diversified data into a unified and consistent body of useful information. In larger organizations, many different types of users with varied needs must utilize the same massive data warehouse to retrieve those pieces of information which best suit their unique requirements.


Data mining can be defined as the process of exploring and analyzing large volumes of data in order to discover interesting and hidden patterns, rules and relationships with data. The purpose of data mining is to allow a corporation to improve its marketing, sales and customer support operations through better understanding of its customers. Big corporation are using data mining to locate high-value customers, to enhance their product offerings to increase sales and to minimize losses due to error or fraud.


Data mining is a component of a wider process called "knowledge discovery from database". It involves scientists and statisticians, as well as those working in other fields such as machine learning, artificial intelligence, information retrieval and pattern recognition.

Before a data set can be mined, it first has to be "cleaned". This cleaning process removes errors, ensures consistency and takes missing values into account. Next, computer algorithms are used to "mine" the clean data looking for unusual patterns. Finally, the patterns are interpreted to produce new knowledge.

How data mining can assist bankers in enhancing their businesses is illustrated in this example. Records include information such as age, sex, marital status, occupation, number of children, and etc. of the bank's customers over the years are used in the mining process. First, an algorithm is used to identify characteristics that distinguish customers who took out a particular kind of loan from those who did not. Eventually, it develops "rules" by which it can identify customers who are likely to be good candidates for such a loan. These rules are then used to identify such customers on the remainder of the database. Next, another algorithm is used to sort the database into cluster or groups of people with many similar attributes, with the hope that these might reveal interesting and unusual patterns. Finally, the patterns revealed by these clusters are then interpreted by the data miners, in collaboration with bank personnel


Data warehousing is a subject-oriented, integrated, historical and summarized data in support of management's decision making.

Subject oriented

It stores subject-oriented information such as customers, products and students rather than the application areas such as customer invoicing, inventory and student management.


It is the consolidation and integration of corporate application-oriented data from multiple sources. The integrated data source must be made consistent to present a unified view of the data to the users.


Data warehouse data is historical. It represents snapshots overtime. Data is read only because it is historical data.


A data warehousing system can often be summarized to an appropriate level of detail.

A data warehouse provides information to assist companies in decision making. Companies can use the valuable information in a data warehouse to identify trends. A data warehousing is a process that can:

  • Retrieve data from the source systems
  • Transform data into a useful format to place into the data warehouse
  • Manage the database
  • Use tools for building and managing the data warehouse


Organizations that wish to use data mining tools can purchase mining programs designed for existing software and hardware platforms, which can be integrated into new products and systems as they are brought online, or they can build their own custom mining solution. For instance, feeding the output of a data mining exercise into another computer system, such as a neural network, is quite common and can give the mined data more value. This is because the data mining tool gathers the data, while the second program (e.g., the neural network) makes decisions based on the data collected.

Different types of data mining tools are available in the marketplace, each with their own strengths and weaknesses. Internal auditors need to be aware of the different kinds of data mining tools available and recommend the purchase of a tool that matches the organization's current detective needs. This should be considered as early as possible in the project's lifecycle, perhaps even in the feasibility study.

Most data mining tools can be classified into one of three categories: traditional data mining tools, dashboards, and text-mining tools. Below is a description of each.

  • Traditional Data Mining Tools. Traditional data mining programs help companies establish data patterns and trends by using a number of complex algorithms and techniques. Some of these tools are installed on the desktop to monitor the data and highlight trends and others capture information residing outside a database. The majority are available in both Windows and UNIX versions, although some specialize in one operating system only. In addition, while some may concentrate on one database type, most will be able to handle any data using online analytical processing or a similar technology.
  • Dashboards. Installed in computers to monitor information in a database, dashboards reflect data changes and updates onscreen - often in the form of a chart or table - enabling the user to see how the business is performing. Historical data also can be referenced, enabling the user to see where things have changed (e.g., increase in sales from the same period last year). This functionality makes dashboards easy to use and particularly appealing to managers who wish to have an overview of the company's performance.
  • Text-mining Tools. The third type of data mining tool sometimes is called a text-mining tool because of its ability to mine data from different kinds of text - from Microsoft Word and Acrobat PDF documents to simple text files, for example. These tools scan content and convert the selected data into a format that is compatible with the tool's database, thus providing users with an easy and convenient way of accessing data without the need to open different applications. Scanned content can be unstructured (i.e., information is scattered almost randomly across the document, including e-mails, Internet pages, audio and video data) or structured (i.e., the data's form and purpose is known, such as content found in a database). Capturing these inputs can provide organizations with a wealth of information that can be mined to discover trends, concepts, and attitudes.

Besides these tools, other applications and programs may be used for data mining purposes. For instance, audit interrogation tools can be used to highlight fraud, data anomalies, and patterns. In addition, internal auditors can use spreadsheets to undertake simple data mining exercises or to produce summary tables. Some of the desktop, notebook, and server computers that run operating systems such as Windows, Linux, and Macintosh can be imported directly into Microsoft Excel. Using pivotal tables in the spreadsheet, auditors can review complex data in a simplified format and drill down where necessary to find the underlining assumptions or information.

When evaluating data mining strategies, companies may decide to acquire several tools for specific purposes, rather than purchasing one tool that meets all needs. Although acquiring several tools is not a mainstream approach, a company may choose to do so if, for example, it installs a dashboard to keep managers informed on business matters, a full data-mining suite to capture and build data for its marketing and sales arms, and an interrogation tool so auditors can identify fraud activity.


In addition to using a particular data mining tool, internal auditors can choose from a variety of data mining techniques. The most commonly used techniques include artificial neural networks, decision trees, and the nearest-neighbor method. Each of these techniques analyzes data in different ways:

  • Artificial neural networks are non-linear, predictive models that learn through training. Although they are powerful predictive modeling techniques, some of the power comes at the expense of ease of use and deployment. One area where auditors can easily use them is when reviewing records to identify fraud and fraud-like actions. Because of their complexity, they are better employed in situations where they can be used and reused, such as reviewing credit card transactions every month to check for anomalies.
  • Decision trees are tree-shaped structures that represent decision sets. These decisions generate rules, which then are used to classify data. Decision trees are the favored technique for building understandable models. Auditors can use them to assess, for example, whether the organization is using an appropriate cost-effective marketing strategy that is based on the assigned value of the customer, such as profit.
  • The nearest-neighbor method classifies dataset records based on similar data in a historical dataset. Auditors can use this approach to define a document that is interesting to them and ask the system to search for similar items.

Each of these approaches brings different advantages and disadvantages that need to be considered prior to their use. Neural networks, which are difficult to implement, require all input and resultant output to be expressed numerically, thus needing some sort of interpretation depending on the nature of the data-mining exercise. The decision tree technique is the most commonly used methodology, because it is simple and straightforward to implement. Finally, the nearest-neighbor method relies more on linking similar items and, therefore, works better for extrapolation rather than predictive enquiries.

A good way to apply advanced data mining techniques is to have a flexible and interactive data mining tool that is fully integrated with a database or data warehouse. Using a tool that operates outside of the database or data warehouse is not as efficient. Using such a tool will involve extra steps to extract, import, and analyze the data. When a data mining tool is integrated with the data warehouse, it simplifies the application and implementation of mining results. Furthermore, as the warehouse grows with new decisions and results, the organization can mine best practices continually and apply them to future decisions.

Regardless of the technique used, the real value behind data mining is modeling - the process of building a model based on user-specified criteria from already captured data. Once a model is built, it can be used in similar situations where an answer is not known. For example, an organization looking to acquire new customers can create a model of its ideal customer that is based on existing data captured from people who previously purchased the product. The model then is used to query data on prospective customers to see if they match the profile. Modeling also can be used in audit departments to predict the number of auditors required to undertake an audit plan based on previous attempts and similar work.


Benefits of Data Mining

Organizations' point of view

Data mining is very important to businesses because it helps to enhance their overall operations and discover new patterns that may allow companies gives better serve to their customers. Through data mining, financial and insurance companies are able to detect patterns of fraudulent credit card usage, identify behavior patterns of risk customers, and analyze claims.Besides that, data mining also help these companies minimize their risk and increase their profits. Since companies are able to minimize their risk, they may be able to charge the customers lower interest rate or lower premium. Companies are saying that data mining is beneficial to everyone because some of the benefit that they obtained through data mining will be passed on to the consumers.

Data mining allows marketing companies to target their customers more effectively, therefore, can reduce their needs for mass advertisements. As a result, the companies can pass on their saving to the consumers. According to Michael Turner, an executive director of a Directing Marking Association"Detailed consumer information lets apparel retailers market their products to consumers with more precision. But if privacy rules impose restrictions and barriers to data collection, those limitations could increase the prices consumers pay when they buy from catalog or online apparel retailers by 3.5% to 11%".

When it comes to privacy issues, organizations will say that they are doing everything they can to protect their customers' personal information. In addition, they only use consumer data for ethical purposes such as marketing, detecting credit card fraudulent, and etc. To ensure that personal information are used in an ethical way, the chief information officers (CIO) Magazine has put together a list of what they call the Six Commandments of Ethical Date Management. The six commandments include: "1) data is a valuable corporate asset and should be managed as such, like cash, facilities or any other corporate asset; 2) the CIO is steward of corporate data and is responsible for managing it over its life cycle (from its generation to its appropriate destruction); 3) the CIO is responsible for controlling access to and use of data, as determined by governmental regulation and corporate policy; 4) the CIO is responsible for preventing inappropriate destruction of data; 5) the CIO is responsible for bringing technological knowledge to the development of data management practices and policies; 6) the CIO should partner with executive peers to develop and execute the organization's data management policies."

Since data mining is not a perfect process, mistakes such as mismatching information will occur. Companies and organizations are aware of this issue and try to deal it. According to Agrawal, an IBM's researcher, data obtained through mining is only associated with a 5 to 10 percent loss in accuracy. However, with continuous improvement in data mining techniques, the percent in inaccuracy will decrease significantly.

Benefits of Data Warehouse

There are a large number of obvious advantages involved with using a data warehouse. As the name suggests, a data warehouse is a computerized warehouse in which information is stored.

The organization that owns this information can analyze it in order to find historical patterns or connections that can allow them to make important business decisions. In this article I will go over some of the advantages and disadvantages that are connected to data warehouses.

One of the best advantages to using a data warehouse is that users will be able to access a large amount of information. This information can be used to solve a large number of problems, and it can also be used to increase the profits of a company. Not only are users able to have access to a large amount of information, but this data is also consistent. It is relevant and organized in an efficient manner. While it will assist a company in increasing its profits, the cost of computing will greatly be reduced. One powerful feature of data warehouses is that data from different locations can be combined in one location.

There are a number of reasons why this is important. When data is taken from multiple sources and placed in a centralized location, an organization can analyze it in a way that may allow them to come up with different solutions than they would if they looked at the data separately. Data mining is connected to data warehouses, and neural networks or computer algorithms are responsible. When data is analyzed from multiple sources, patterns and connections can be discovered which would not be found otherwise. Another advantage of data warehouses is that they can create a structure which will allow changes within the stored data to be transferred back to operational systems.

However there are a number of disadvantages that need to be mentioned as well. Before data can be stored within the warehouse, it must be cleaned, loaded, or extracted. This is a process that can take a long period of time. There may also be issues with compatibility. For example, a new transaction system may not work with systems that are already being used. Users who will be working with the data warehouse must be trained to use it. If they are not trained properly, they may choose not to work within the data warehouse. If the data warehouse can be accessed via the internet, this could lead to a large number of security problems.

Another problem with the data warehouse is that it is difficult to maintain. Any organization that is considering using a data warehouse must decide if the benefits outweigh the costs. Once you have paid for the data warehouse, you will still need to pay for the cost of maintenance over time. The costs involved with this must always be taken into consideration. When it comes to storing information, there are two techniques which are used. The first is called the dimensional technique. When the dimensional technique is used, information will be stored within the data warehouse as facts. These facts will take the form of either text or numerical information.

Data which is stored with the dimensional technique will contain information which is specific to one event. The dimensional technique is useful for workers who have a limited amount of information technology skills. It makes the data easy for them to study and understand. In addition to this, data warehouses that use the dimensional technique tend to operate quickly. The biggest problem with the dimensional technique is if the company decides to change the way it conducts business, it will be difficult to change the data warehouse to support it. The second technique that is used storing data is called database normalization. With this technique, the data is store in a third normal form. While adding data is easy, producing reports can be tedious.


As a conclusion, data mining can be beneficial for businesses, governments, society as well as the individual person. However, the major flaw with data mining is that it increases the risk of privacy invasion. Currently, business organizations do not have sufficient security systems to protect the information that they obtained through data mining from unauthorized access, though the use of data mining should be restricted. In the future, when companies are willing to spend money to develop sufficient security system to protect consumer data, then the use of data mining may be supported.