Knowledge Discovery In Database Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data mining, data warehouse, knowledge discovery in database, classification, association rule, neural networks, genetic algorithms, decision tree, nearest neighbor

They have been immense advances in data collection and storage technologies that made it possible for many organizations to keep vast amount of their essential data that are related to their activities. In many cases, because of either the mixture of missing of records or the data are in the qualitative nature rather than quantitative measures, the data cannot be analyzed by standard statistical methods. The overwhelmed increase of data and databases in organizations has created the need to develop new methodologies and tools for a better serve to their customers. This is why data mining approaches has been adopted by many organizations for both in public and private sector as well. Data mining can be occurred within only a data source or numerous data sources. There are several methods or techniques and tools of data mining that may be used by organizations to analyze data sources for the use of discovering valuable and analyzing of new pattern, hidden pattern, trends and relationships from various data. Data mining has a highly complementary relationship with data warehousing. It is very important for analyst or practitioner to deeply understand the interactions and relationship between business practices and organizations as a key in adopting a flexible data mining techniques to solve organization specific problems.

All organizations collect data from their business performance in order to provide the basis competitive advantage that allowing them to efficiently analyze their most important assets or data. The process of analyzing of this valuable data can be considered as data mining practices. It would be beneficial to the organizations to have a way to capture large databases for important information that may contain within. As mentioned above, data mining is one of the steps involved in the knowledge discovery in databases (KDD) process, which deals with discovering hidden patterns from various data mining techniques. According to Michael L. Gargano and Bel G. Raggad, the process or activity of discovering new and valuable information is a major function of data mining. With the discovered of new pattern and trends, it assist analyst or practitioner to learn more from their valuable information. [1] With this approaches, it can also be employed to give the analyst a clearer view of what is going on in the data. The Figure 1 shows the data mining process in extracting data and turns it into useful information.

C:\Documents and Settings\Owner\My Documents\My Assignment\Sem_1\Dr Norman Masrek SMM740\Individual_sources\HTML\data_mining.gif

Data mining can be viewed as a process of solving a puzzle. Each pieces of the puzzle are actually very simple structures. However, if the complete pieces of the puzzle successfully joined it can construct very sophisticated and detailed system. The entire process of joining each pieces of the puzzle might be annoyed us, but once we know how to join the pieces, we realize that the process is actually not really that tough. This scenario can be applied to data mining concept. In real meaning, we actually do not know the valuable data sources in our organizations. But once we know it, we note that it will assist us in analyzing, predicting and discovering a pattern, trends and relationship of data.

Many organizations with a strong consumer focus such as retail, marketing, communication and financial industry are commonly apply data mining practices to its business operations. How the data mining will assist them in improving their services to the customer and to enlarge the company profits as well is based on the data mining techniques that they applied to meet the needs of organizations goal and objectives.

The rapid advancement in data capture, data transmission and storage capabilities encourage most organizations to assimilate their databases into data warehouse. With data warehousing, organization’s current and historical data are merged into a single repository. Basically, data warehouse is created to be a repository for informational data which it is a useful tool in supporting management’s decision making as it is a subject oriented which is all the related data are connected together, integrated time variant which is the changes of data in database is detected all the time and nonvolatile collection of data which data is always added and never deleted. All the data stored in the data warehouse is collected from different department across the organizations which contain operational data that represent the day to day needs of organization and external data sources. Data warehouse act as a centralized area of data that help the users maximize their access and analysis activities. Figure 2 exhibits the content of data warehouse across the organizations.

C:\Documents and Settings\Owner\My Documents\My Assignment\Sem_1\Dr Norman Masrek SMM740\Individual_sources\HTML\datawarehouse.gif

The complexity and amount of data in data warehouse has developed decisive need of new capabilities to identify trends and relationship in the data. This is due to uncomplicated query and reporting tools are not sufficient. Thus, data mining is playing its role in discovering pattern, trends and relationship data in data warehouse in more sophisticated and more productive way (Pass 1997a). Data warehousing is actually develops the performance success in data mining. A data warehouse may content an integrated data, detailed and summarized data, past history data and metadata which is these all elements is very crucial in improving the data mining quality (Inmon, 1997). [2] 

For example, the integrated data that have been cleansed can help in improving the data mining process. This is because, once the data has been uploaded into the data warehouse, the analyst can easily focus on their next process in mining the data into useful information rather than have to purify and integrate the data. Besides, the analyst also can analyze data in a wide range as data warehouse content a variety of detailed data and the data is always keep adding and never been removed from the data warehouse. Further, this enables the analyst to see the nature of the business as a whole. Figure 3 shows the overview of data warehouse flow.

C:\Documents and Settings\Owner\My Documents\My Assignment\Sem_1\Dr Norman Masrek SMM740\Individual_sources\HTML\DataWarehouse_Overview.gif

There are some of the most widely used alternative data mining techniques such as nearest neighbor method, decision trees, association rules, neural networks and genetic algorithm are adopted by several of companies. All these methods are employed with the user mainly out of analysis circle. In addition these data mining techniques using a set of rules and algorithms that are computed automatically over an entire data set.

Decision tree can be defined as an analytical tools used to generate rules and relationship by analytically breaking down and subdividing the information into the data set. Classification and Regression Trees (CART) and Chee Square Automatic Interaction Detection (CHAID) are examples of specific decision tree method. These two techniques are used for classification of data set. [3] Classification is one of the common tasks of data mining which it maps data into predefined group or classes. Basically it is often referred to as supervised learning as the classes are discovered before examining the data.

Decision tree can be used to solve problems in determining broad categorical classification and prediction as well. However this technique is not recommended to apply in making specific predictions on the values of quantitative variables. Decision trees are very helpful in splitting values of variables into small numbers. Figure 4 show example of classification process using decision trees technique.

C:\Documents and Settings\Owner\My Documents\My Assignment\Sem_1\Dr Norman Masrek SMM740\Individual_sources\HTML\decision_tree.png


Association rules also known as a market basket analysis. [4] This data mining technique helps organization to discover correlations or co-occurrences of transactional events. Association rules technique will be most useful when doing analysis that searching for interesting trends and relationship within a data set.

For example, Giant Hypermarket is trying to decide whether to put bread on sale. In order to determine which other products are frequently purchased with bread, the analyst of Giant Hypermarket generates association rules which is one type of data mining techniques. The analyst discovers that almost 70% of the time bread is sold and also margarine or jelly is also sold. As a result, the analyst attempts to capitalize on the association between these both products by placing it together. In addition, the analyst also might reduce the price of bread but in the same time the price of margarine or jelly is 2 cents increases from the normal price. In this case, there is highly possibility that customers straightly buy without seeing the price of margarine or jelly even the increases amount of price is very small. This scenario shows how data mining practices assist the Giant Hypermarket to increase their revenue. Besides it also assists the analyst to discover the pattern and trends of customer in choosing and buying the products and can also be helpful in predicting future buying trends and forecast supply demands as well.

A neural network is a computational methodology that used for pattern identification and classification. It often referred to as artificial neural networks (ANN). The neural networks is an information processing system that consists of graph which representing the processing system and different type of algorithms that access the graph. Further, this technique is arranged as a directed graph with many nodes which is processing elements and arcs which is interconnections between both of them. Neural networks elements consist of source (input), sink (output) and internal (hidden) nodes. [5] This data mining technique is useful to search for original ways of segmenting a data set. It also can be applied to discover subgroups of defined data in some common features that separate them from other sections of the complete population. Figure 5 shows the sample of the multilayered artificial neural networks.

C:\Documents and Settings\Owner\My Documents\My Assignment\Sem_1\Dr Norman Masrek SMM740\Individual_sources\HTML\multilayer.gif


Genetic algorithm is an optimization techniques used to find an optimal solution to a problem. Usually genetic algorithms techniques use survival of the fittest strategy to learn coefficients of a linear discrimant function. [6] This technique involves three basic mechanisms by which information is chosen, altered and passes on in order to achieve optimization. There are:


Selection process is based on the principal of survival fittest in which individuals that are best suited for the environment, are the ones that survive to pass their genetic objects on the next generation. This process is done randomly.


Crossover technique is the situation when two individuals (parents) from the population generates new individual (offspring or children) by switching subsequences of the strings. [7] When crossover takes place, new offspring are created by combining or matching subsets of information contained in these vectors from both parents.


The last process used in genetic algorithms is mutation. Mutation occurs to randomly modify the genetic structures of the parents of each new offspring. Further, mutation process can increase the search space for solutions that might not show originally in the data set.

The utilization of this data mining helps in maximizing profit and switch ability by searching through combinations of product features. This process will run hundreds of times until the potential and accurate solution is found. [8] Figure 6 shows the overview of Genetic Algorithms process.

C:\Documents and Settings\Owner\My Documents\My Assignment\Sem_1\Dr Norman Masrek SMM740\Individual_sources\HTML\ga.png


Nearest neighbor method sometimes called K-nearest neighbor. This technique is one of common classification scheme based on the classes of the k-records that closest to it historical data set are combined together. [9] 

Data mining produced several advantages to variety of industries field. Below are some examples of areas where data mining has been successfully adopted.

Retail and marketing industries - Improved marketing campaigns

The effectiveness of adopting data mining into marketing campaign design can be seen in the aspect of response observation to improved campaign. Marketing organizations can save up the expenses in making the marketing campaign design by utilizing data mining techniques.

For example, Company A is actively promotes their products by sending flyers, mailing and letters to their customers. This effort can run several amount of money in order to copy, design, printing, postage and handling cost. However, it is essential to have accurate mailing address to send this marketing package to their customers as this will help them in do a prediction or make analysis on their customer response. Due to this fact, data mining technique is playing an important role to identify the pattern from their previous historical data. So that, Company A, can avoid from waste a time, resources and money.

Health Care Industry

Application of data mining techniques may help in minimizing cost of health treatment, avoiding duplication of examination and wasting time. Besides, data mining also can identify potential fraud and abuse in the situation of insurance claims processing, verification and processing. Other than that, health care analysts also can use data mining in analysis patient profiling where data mining can discovers patient’s health and lifestyle histories as they can impact medical coverage and service utilization.

Education Industry - Effective educational process

Data mining techniques help institutions to enhance enrollment management and maintain their amount of students.

For example, the institutions may use their historical data on their students to identify the pattern and trends of a successful students and school leavers. By using the data and apply it in data mining techniques it help the institutions to predict what is the suitable course for the students to enroll based on their capabilities, qualifications and interest. With an appropriate pattern and trends resulted from data mining techniques, it can saves time, reduces the staff pressure and also improved delivery of educational services.

Banking Industry

Data mining also is very helpful in banking industry. It helps the banking analyst to identify the pattern and trend of fraudulent credit card usage. Then with that identified pattern and trends they can come out with some prediction of the future potential credit card user in the aspect of possibility of user to change their credit card affiliation and amount spending by customer. Besides that, data mining techniques also can recognize the loyal customers where it is the most important for every bank to keeps their business profitable. Further, by utilizing data mining techniques, the bank analyst also can find the hidden correlations between varieties of financial competitors.

As mentioned above, data warehouse is designed to meet strategic information needs for making long term decisions which are often based on trend analysis of historical information and data. There a few advantages of data warehouse to organizations. Some of the advantages are as follows:

Data warehouse provides information to determine the most effective promotion media for a particular group. This is applicable for marketing and sales industry especially.

Data warehouse also provides information on the potential customers who are likely to stop using the customer services.

Data warehouse provide information to detect sales trends for product line and customer class.

All the interest and important data that are related to the data sources can be found in data model content in data warehouse. This assists the analysts to identify and analyze multiple data models to retrieve valuable information.

Data warehouse enable organization migrate legacy system to a new environment and improving data quality.

Data warehouse helping organizations deal with dynamic business environment. This is because data warehouse has widely useful resources to consolidate corporate information and share it among organizational in order to support top’s decision making.

The data warehouse is an enabler which provides organizations with access to the available rich contained in its information assets.

The integration of data mining techniques and data warehouse into normal day to day business activities has become commonplace. Data mining applications can derive a lot of demographic information pertaining to customers that previously not known or hidden in the data. Recently, the interest in utilizing data mining practice has increasingly adopted to many industries such as marketing, sales, banking, retailer and many more. In order to keep business profitable, it is essential for every practitioner or analyst to have an understanding of what things are interest and importance to their organization. This is the major steps before begin developing a data mining system. In addition, a good practitioner should know what type and classes of patterns that might be identified before the first record is processed. Data warehouse major function is to support in long term strategic decisions. Data are extracted from operational systems, but must be reformatted, cleansed, integrated and summarized before place it into data warehouse. All the data in the data warehouse is always added and never deleted. Due to this fact, it helps the management to see the clear picture of their business nature. Data mining and data warehouse is very much related but data warehousing is not a prerequisite for a data mining solution. [10]