Data warehouse and data mining
In today's competitive environment, it is depressingly needed a technology which can help the organization manage their information or data. According to Intelligent Enterprises (2010) expansion in data warehousing is being fueled by the so-called "big-data era." Data are being produced by the terabytes per day and sometimes per hour by the web sites, enterprise applications and networks.
According to Ma Chou & Yen (2002), data warehouse is a concept that results from the combination of two sets of needs, which is the need of the information systems department to manage company data in enhanced way and the business requirement for a company-wide view of information. In other words, by using data warehouse technologies, the users can generate their report and queries from information that contains in databases in order to make an effective and efficient decision making regarding to their business.
Moreover, understanding the concepts of data warehouse and data mining is important to realize how can these two concepts benefited us to assist in widespread information delivery throughout the organization and how to make them a useful powerful tool to support any corporate decision-making process. It is an advantage, if the organizations know how to manage, locate and retrieve their information and utilize them efficiently and effectively to make a better decision and moving one-step forward from their competitors.
As data warehousing and data mining become buzzword now, there are many research conducted in order to understand this topic. In addition, it is important for IT professionals to understand how the organization retrieves and exploits data and how business intelligence tools can be useful to achieve organizational objectives.
2.0 Data warehouse
Many definitions of data warehouse in the framework of database management systems have been proposed in the literature, focus on a range of aspects of the issue and try to achieve different goals. Therefore, many authors have offered their own definitions of data warehouse. For instance, Mattison (1996) defined according to various point of view. It is also defined as a database that is organized to serve as neutral data storage area, used by data-mining and other applications, meets a specific set of business requirements and uses data that meet a predefined set of business criteria.
Another definition was offered by Inmon (1993), he define it as "a subject oriented, integrated, time variant, non-volatile set of data in sustain of management's decision making process". The author also recommended that having data warehouse is the best storage location as it well managed and those strategic data are centralized. Besides that Thuraisingham (1999) defined data warehouse as a heterogeneous databases that assembles the data so that user's query depends on the content of the warehouse.
According to Atkinson (2001) taken from Poe, Klauer and Brobst (1998) described data warehouse as an analytical databases that is used as the foundation of a decision support system. In addition, it is designed for large volumes of read-only data, providing intuitive access to information that will be used in making decisions. And it is created as an ongoing commitment by the organization to ensure the appropriate data available to the appropriate end user at the appropriate time.
According to Singhal (2007) taken from Chaudhuri and Dayal (1997), Inmon (1996), Kimball (1996), Pyne (1999) describes that data warehousing is a necessary technology for colleting information from distributed databases and then performing analysis. Therefore, there is a critical need of data analysis systems that can automatically analyze the data, summarize it and predict the future trends.
Based on above definitions we can understand that data warehouse is related to the storage vast amount of data from distributed databases. Nevertheless, we must bear in mind that data warehouse in general does not attempt to extract information from the data warehouse. While data warehousing formats the data and organize the data to support management functions, data mining attempts to extract useful information as well as predicts trends from the data. Thus, data warehouse always related to data mining as they relate to each other to perform the query and predict trends.
2.2 Data Warehouse Architecture
1. The first stage is a warehouse database server that is a relational database system. Data from operational databases and other external sources is extracted, transformed and loaded into the database server. According to Ma, Chou and Yen (2000), and as it discussed by Brown (1995), this process also called "data extraction". It means that finding the source data, extracting the data and preparing them to be loaded into the data warehouse.
2. The second stage is an OLAP (Online Analytical Processing) server that is implemented using one of the following methods. This second method is to use a relational OLAP model that is an extension of RDBMS (Relational Database Modeling Systems) technology. According to Ma, Chou and Yen (2000) as cited by Brown (1995), data warehouse are presented by three basic technical characteristic which one of the characteristic are high performance of RDBMS technology. It means that the RDBMS which houses decision making data should be capable of handling a large amount of data, analytic processing and complex queries.
The authors also briefly explain that OLAP, is "a set of functionalities that can facilitate multidimensional analysis and manipulated aggregated data into various category". Besides of that, Ma, Chou and Yen (2000) as cited by Inmon (1992), indicate several factors that can accelerate the need of OLAP environment in order to facilitate more efficient data access, "pre-aggregating data for better performance, pre-categorizing data for enhanced understanding and usability by end users and other derived data to ensure accuracy throughout the organization". Therefore, OLAP is an important method in the data warehouse architecture.
3. The third stage is a client, which contains querying, reporting and analysis tools. According to Ma, Chou and Yen (2000), data warehouse are primarily uses for the presentation of standard reports and graphs. This means that, it allows data coming from different transactions systems to be consolidated into warehouse and used in reporting.
2.3 Benefits of Data Warehouse
Most organization can commercially benefit data warehouse by providing tools for business executives to systematically organize, understand and use the data for strategic decisions. According to Peter (2003) the core benefits of data warehouse includes, addition of disaster recovery plans with an additional data back up source and improved data wholeness and quality. As we know, data quality and completeness are important in making a good decision in business. Therefore a crucial need of having right information at the right time and the right person to support decision making process can be establish by having data warehouse.
Besides that, the data warehouse is able to offer savings in billing processes, diminish the cost of reporting, and decrease fraud losses. Exforsys Inc (2000) noted the benefit of data warehouse is the capability to lever server tasks linked to querying which is not used by mainly transaction systems. Therefore by having a proper tools and well-trained user, most organization can benefits from data warehouse. Other than that, by having data warehouse, the organizations are facilitating their corporate knowledge in a much better way.
This is because it has the ability to analyze and execute business decision based on data as of multiple sources. For example, an organization has composed important data and stored it in 20 databases then, all the data stored in a centralized data repository. Thus, it enables faster decision making by allowing the retrieval process of the information or data kept in the databases as it can be access remotely and also well managed of the data stored.
3.0 Data mining
Data mining has come a long way over the last few years. It emerged as a technologies area in the early 1990s (Thuraisingham, 1999). According to Lee and Siau (2001) taken from (Fayyad et al., 1996; Frawley, Piatetsky and Shapiro, 1991; Indurkhya and Weiss, 1998) since then data mining has become a research area with increasing importance and it continue to attract more and more attention in the business and scientific communities. As many research has been done in order to understand and exploring this new technologies, many researcher try to defined and express their ideas by defining this terms according to their understanding.
According to Ma, Chou and Yen (2000) cited by Reeves (1995) defined data mining from business perspectives, as the process of scanning a large data set to glean information. It is further explain by the authors that are data mining is the process of applying artificial intelligent techniques (such as advanced modeling and rule induction) to a large data set in order to determine patterns in the data.
Besides that data mining also defined as the process of extract formerly unknown but significant information from large databases and using it to make critical business decisions (Singh 1998) cited by Atkinson (2001). This author also cited definition of data mining given by Pritchard (1998) as it "scrutinize more difficult issues, such as how existing customers are likely to retain, what new products are worth stocking in a store or why customers might switch between one company and another". This two definition given a clear picture of how does the data mining works and it is more easy to understand as it use layman terms.
According to Thuraisingham (1999) data mining is the process of posing various queries and extracting useful information, trends and patterns frequently up to that time unknown from huge quantities of data probably stored in databases. For better understanding, the author also illustrated how data mining can be used. Which she obtained from Grupe and Owrang (1998) the example are:
Â§ A supermarket store analyzes the purchases made by various people and arranges the items on the shelves in such way to improve sales.
Â§ By analyzing patient history and current medical conditions, physicians not only diagnose the medical conditions but also predict potential problems that could occur.
Source: Thuraisingham, Bhavani M. (1998). Data mining: technologies, techniques, tools and trends. Boca Raton : CRC Press.
3.3 Data Mining Tools
The tools that been used in data mining are straightforward, brief and easy to implement algorithms that model nonrandom relationship or model in large significant data sets. According to Gargano and Raggad (1999) these models can then be applied to narrative data in order to optimize, correlate, forecast, or organize. The authors also have listed a broad scale of data mining tools including artificial intelligent methods (Firebaugh, 1998), decision trees (Ginsberg, 1993), genetic algorithms (Mitchell, 1996; Goldberg, 1989) and genetic programming (Koza, 1992), rule induction methods (Michalski et al., 1983), artificial neutral network (Chester, 1993; Vemuri, 1988) and expert system (Turban, 1990; Friederich and Gargano, 1989).
In this article only five data mining tools will be discussed they are:
3.3.1 Expert systems
According to Gargano and Raggad (1999) expert systems "consists of a facts or data, logic based deduction engine , and knowledge base of rules which is mine from experts, In which it produce facts and new rules based on earlier collected knowledge and facts".
The authors also stated that the rules that used to represent knowledge is if /then rules. A model knowledge base may consist of these three rules:
1. if A then D;
2. if (B or D) then C;
3. if C then E.
Even thought this tool is excellent at knowledge captured. The authors have discussed that this tool is robust. This is because they are disreputably breakable and cannot easily support illogical complexity, internal inconsistencies, or poor clarity both in the facts and the rules. Meanwhile if there is additional rules are being added up to the knowledge base the unpredicted results may arise even thought the rules were reorganized.
3.3.2 Fuzzy expert systems
According to Gargano and Raggad (1999) in fuzzy expert systems, truth value can recline anyplace on the 0.0 to 1.0 gap of real numbers. In the other hand in expert systems, information is brittle, which it is either completely false (i.e zero) or totally true (i.e one). This tool can make a proposition via fuzzy operators, such as VERY, OR, AND, SOMEWHAT and NOT.
It is further discuss that fuzzy expert systems obviated the need for exact mathematical presentations and are easier to debug. Implementing fuzzy expert system is likely expert systems do, but somehow it can attain knowledge from the experts. That will make knowledge elicitation become easier.
3.3.3 Decision trees
According to Gargano and Raggad, (1999) this tool is based on simple tree models where at each branch throughout tree growth the data set is deliberately portioned into different classes and subclasses. As noted by Lee and Siau (2001) the specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID), this two decision tree techniques is used for classification of a data set. He also stated between this techniques CART requires less data preparation than CHAID.
This two decision tree techniques provide a set of rules that can applied to a new (unclassified) data set to predict which records will have a given outcome. It is easy to clarify, simple to build and easy to understand.
3.3.4 Rule indication
According to Gargano and Raggad, (1999) this tool "uses statistical discovery methods to develop rules which depend on the frequency of correlation, the rate of accuracy, and the accuracy of prediction". It is can be understood as the process of mining association rules in a massive database where, a set of attribute value is mined from the associated data sets in a database.
Besides that, the authors also stated the benefit of using this tool where it generates rules which are straight based on actual data statistic. For example, if the buyer buys one brand of breads, there is a possibility that he/she usually buys another brand of breads in the same transaction.
3.3.5 Genetic algorithm and genetic programming
According to Lee and Siau (2001) genetic algorithm is a population of rules, which each of them are representing a possible solution to a problem and it is initially created at random. Besides that Gargano and Raggad, (1999) discussed about genetic programming which it evolves complex algorithm structures (i.e computer programs), genetic algorithm evolve complex data structure.
Besides that Gargano and Raggad (1999), also address the advantages using genetic algorithm model that is, it is easy to clarify, vigorous, and instinctively interesting model which scrutinize the search space with a wide argument and can simply handle multidimensional problems.
3.4 Data Mining Techniques
Data mining techniques are numerous and are the important part of data mining. According to Lee and Siau (2001) data mining techniques can be divided into six major techniques. Which are statistics, techniques for mining transactional/relational database, artificial intelligence (AI) techniques, decision tree approach, genetic algorithm and visualization. Meanwhile Singhal (2007) comes out with four categories of data mining techniques, they are:
1. Association Analysis
According to Singhal (2007) this technique engage with discovery of association rules showing attribute-value conditions that take place regularly together in a specified set of data. This is used frequently for market basket or transaction data analysis. For example, the following rule says that if the customer is an age group 20 to 29 years and income is greater than 40K/year then he or she is likely to buy a DVD player.
Age(X, "20-29") & income (X,> "40K") => buys (X, "DVD Player")
[support = 2%, confidence = 60%]
Rule support and confidence are two measures of rule interestingness. A support of 2% means that 2% of all transactions under analysis show that this rule is true. A confidence of 60% means that among all customers in the age group 20-29 and income greater than 40K, 60% of them bought DVD players.
2. Classification and Prediction
According to Singhal (2007) "classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends". For example, an insurance company determines the patients who might be potentially expensive by analyzing various patient records.
Some of the basic techniques for data classification are decision tree induction, Bayesian classification and neutral networks. These techniques find a set of data models that can described the different classes of objects. These models can be used to predict the class of an object for which the class is unknown. The derived model can be represented as rules decision trees, (IF-THEN) or other formulae.
This involves grouping objects so that objects inside a group have high likeness but are very different to objects in other group. Clustering is based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity (Singhal, 2007).
In business, clustering can be used to identify customer groups based on their purchasing patterns. It also can be used to help classify documents on the web for information discovery.
4. Outliner Analysis
A database may contain data objects that do not act in accordance with the general model or behavior of data. These data objects are called outliers. These outliers are useful for applications such as fraud detection and network intrusion detection. According to Lee and Siau (2001), it is important to detect outliers in data mining in order to separate the good and bad data. And it is usually used in data cleaning process.
Data mining can benefits the organization in the process of decision making along with learning throughout information discovery of the data that contain in the databases. It is worthwhile besides that it offers many benefits such as the accurate identification of buying trends, the optimization of promotional programs, and precise definition of market segments (Ma, Chou and Yen, 2000).
This is because the data mining can generate specific information which can help in decision making process by using data mining tools to perform the query task. By having the right information, a better decision can be made thus, lower the risk of the consequences of the decision that have been made.
Besides that Ma, Chou and Yen (2000) also stated that data mining can gives benefit to the organization by facilitate accurate data identification and analysis which can increase the quality of decision making, strong navigation, computation, synthesis capabilities make it possible to gain critical competitive advantages and relevant information is obtained faster and time is used more effectively. For example, an automobile sales company analyzes the buying patterns of people living in various locations and sends them brochures of cars that customer likely to buy.
According to Nemati and Barko (2003) as cited by (Whiting, 2002) by intelligently targeting consumers who were more likely to buy, the retailer reduced catalog circulation by 50 percent in 2001 and still increased revenue per catalog by almost 20 percent. Besides of saving the cost of, it also can increases in user efficiency and productivity.
In a nutshell, we can conclude that data warehousing and data mining is essential tools to each organization in order to strategically managed and extensively use the information that they held in order to compete with others. This technology has significant benefits, both in terms of the organization or management of the data in warehouse or the results of data mining that been used in various purpose in conduct daily business such as to make a better decision.
Based on my reading, there are many researches conducted in order to improve the data warehouse and data mining issues. Such as the proliferation of distributed data warehouse within the organization, the quality of data, incorrect and incomplete data and the growth of electronic data in support of daily transactional business will leads to vast creation of warehouse to keep those data. As an impact of this, there is the need of better of data warehouse architecture and data mining tools and techniques. In order to achieve this we must examine the previous or current technologies that we used, such as what are the limitations and try to come out with new solution to overcome the limitations.
Moreover, as the benefits of data mining carried, most of the organization is rushing to implement data warehousing and data mining in their organization. But its, a lengthy and complicated process therefore it is important that the organization itself conduct a researches on the successful implementation of data warehousing and data mining in other organization as their guidelines. The best lesson is learning from the others mistakes.