Process Of Extracting Interesting Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data mining is the process of extracting interesting non-trivial, implicit, previously unknown and potentially useful information or patterns from large information repositories such as: relational database, data warehouses, XML repository, etc. Also data mining is known as one of the core processes of Knowledge Discovery in Database (KDD). Many people take data mining as a synonym for another popular term, Knowledge Discovery in Database (KDD). Alternatively other people treat Data Mining as the core process of KDD. The KDD processes are shown in Figure 3.1 [137]. Usually there are three processes. One is called preprocessing, which is executed before data mining techniques are applied to the right data? The pre-processing includes data cleaning, integration, selection and transformation.

Figure 3.1: KDD processs

The main process of KDD is the data mining process, in these process different algorithms are applied to produce hidden knowledge. After that comes another process called post processing, which evaluates the mining result according to users' requirements and domain knowledge? The actually processes work as follows.

First it is need to clean and integrate the databases. Since the data source may come from different databases, which may have some inconsistencies and duplications, must clean the data source by removing those noises or make some compromises. Suppose, two different databases, different words are used to refer the same thing in their schema. When it is tried to integrate the two sources it can only choose one of them, if they are known that they denote the same thing. And also real world data tend to be incomplete and noisy due to the manual input mistakes. The integrated data sources can be stored in a database, data warehouse or other repositories. As not all the data in the database are related to our mining task, the second process is to select task related data from the integrated resources and transform them into a format that is ready to be mined. Suppose, wanted to find which items are often purchased together in a supermarket, while the database that records the purchase history may contains customer ID, items bought, transaction time, prices, number of each items and so on, but for this specific task only need items bought. After selection of relevant data, the database that are going to apply our data mining techniques to will be much smaller, consequently the whole process will be more efficient

Various data mining techniques are applied to the data source; different knowledge comes out as the mining result. That knowledge is evaluated by certain rules, such as the domain knowledge or concepts. After the evaluation, if the result does not satisfy the requirements or contradicts with the domain knowledge, have to redo some processes until getting the right results.


There are two classes of data mining descriptive and prescriptive. Descriptive mining is to summarize or characterize general properties of data in data repository, while prescriptive mining is to perform inference on current data, to make predictions based on the historical data.

There are various types of data mining techniques such as association rules, classifications and clustering. Based on those techniques web mining and sequential pattern mining are also well researched. The following reviews different types of mining techniques.


Association rule mining is one of the most important and well researched techniques of data mining, was first introduced in [138]. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories.


Classification [137] is to build a model that can classify a class of objects so as to predict the classification or missing attribute value of future objects whose class may not be known. It is a two-step process. In the first process, based on the collection of training data set, a model is constructed to describe the characteristics of a set of data classes or concepts. Since data classes or concepts are predefined, this step is also known as supervised learning i.e., which class the training sample belongs to is provided. In the second step, the model is used to predict the classes of future objects or data. There are handful techniques for classification [137]. Classification by decision tree was well researched and plenty of algorithms have been designed, Murthy did a comprehensive survey on decision tree induction [139]. Bayesian classification is another technique that can be found in Duda and Hart [140]. Nearest neighbor methods are also discussed in many statistical texts on classification, such as Duda and Hart [141] and James [142]. Many other machine learning and neural network techniques are used to help constructing the classification models.


Classification can be taken as supervised learning process, clustering is another mining technique similar to classification. However clustering is a unsupervised learning process. Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects [137], so that objects within the same cluster must be similar to some extend, also they should be dissimilar to those objects in other clusters. In classification which record belongs which class is predefined, while in clustering there is no predefined classes. In clustering, objects are grouped together based on their similarities. Similarity between objects is defined by similarity functions, usually similarities are quantitatively specified as distance or other measures by corresponding domain experts. Most clustering applications are used in market segmentation. By clustering their customers into different groups, business organizations can provide different personalized services to different group of markets. For example, based on the expense, deposit and draw patterns of the customers, a bank can clustering the market into different groups of people. For different groups of market, the bank can provide different kinds of loans for houses or cars with different budget plans.

In this case the bank can provide a better service, and also make sure that all the loans can be reclaimed. A comprehensive survey of current clustering techniques and algorithms is available in [143].


Based on the types of data that mining techniques are applied can be classified into different categories. The following are the some categories.


Till now most data are stored in relational database, and relational database is one of the biggest resources of our mining objects. As relational database is highly structured data repository, data are described by a set of attributes and stored in tables. With the well developed database query languages, data mining on relational database is not difficult. Data mining on relational database mainly focuses on discovering patterns and trends. Also by analyzing the expenses patterns of customers, can provide certain information to different business organizations.

3.3.2 Transactional database

Transactional database refers to the collection of transaction records, in most cases they are sales records. With the popularity of computer and e-commerce, massive transactional databases are available now. Data mining on transactional database focuses on the mining of association rules, finding the correlation between items in the transaction records.


Spatial databases usually contain not only traditional data but also the location or geographic information about the corresponding data. Spatial association rules describe the relationship between one set of features and another set of features in a spatial database. Definitions of spatial association rules and their parameters [144] are identical to those for regular association rules [138]. The form of spatial association rules is also XY, where X, Y are sets of predicates and of which some are spatial predicates, and at least one must be a spatial predicate [144, 145]. Algorithms for mining spatial association rules are similar to association rule mining except consideration of special data, the predicates generation and rule generation processes are based on Apriori, detail of the algorithm for mining spatial association rules were explained in [145]. A spatial association rule mining application GeoMiner [146] has been developed to extract useful information from a geographical database.


From traditional transaction data, for each temporal data item the corresponding time related attribute is associated. Temporal association rules can be more useful and informative than basic association rules.

In [147] and [148] algorithms for mining periodical patterns and episode sequential patterns were introduced respectively. Most of those researches now form a new area of data mining called sequential pattern mining, mining frequent sequential pattern in time series database, which was initiated by Agrawal in [149].


As information on the web increases in a phenomena speed and web becomes ubiquitous, most researchers turn to the field of mining web data is called Web Mining. Web mining is usually divided into three main categories, web usage mining, web structure mining and web content mining. Web usage mining concentrates on mining the access patterns of users, so that the structure of the web site can be modified based on the navigation patterns. Different application of mining web logs have been developed to find navigation patterns. Besides improving the web site structure, web usage mining is also valuable for cross marketing strategies, web advertisements and promotion campaigns. Web structure mining focuses in analysis of structures and links in web documents. The basic idea is that those pages that are linked together have some kinds of relationship. With those links, a typical structure mining is to classify those web documents into authoritative pages and hub pages. Authoritative pages are pages that present the original source of information, while hub pages are pages that link to those authoritative pages. Web content includes text, graphic, media etc. Consequently web content mining includes text mining, multimedia mining and graphic mining.


This section introduced association rule mining problem in detail. Different issues in Association Rule Mining (ARM) will be elaborated together with classic algorithms.

Transaction database is used to apply association rules for crucial decision making process in order to make it easier for us to compare those algorithms. Let I=I1, I2, . , Im be a set of m distinct attributes, T be transaction that contains a set of items such that T € I, D be a database with different transaction records Ts. An association rule is an implication in the form of XY, where X, Y€ I are sets of items called itemsets, and X \ Y . X is called antecedent while Y is called consequent, the rule means X implies Y.

There are two important basic measures for association rules, support(s) and confidence(c). Since the database is large and users concern about only those frequently purchased items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. The two thresholds are called minimal support and minimal confidence respectively, additional constraints of interesting rules also can be specified by the users. The two basic parameters of Association Rule Mining (ARM) are: support and confidence. Support(s) of an association rule is defined as the percentage/fraction of records that contain X Y to the total number of records in the database. The count for each item is increased by one every time the item is encountered in different transaction T in database D during the scanning process. It means the support count does not take the quantity of the item into account.

The first sub problem can be further divided into two sub problems: candidate large itemsets generation process and frequent itemsets generation process. We call those itemsets whose support exceed the support threshold as large or frequent itemsets, those itemsets that are expected or have the hope to be large or frequent are called candidate itemsets. Most of the algorithms of mining association rules we surveyed are quite similar, the difference is the extend to which certain improvements have been made, so only some of the milestones of association rule mining algorithms will be introduced.

Introduction of some naive and basic algorithms for association rule mining, Apriori series approaches. Then another milestone from a supermarket, to explain how those algorithms work. This database records the purchasing attributes of its customers. Suppose during the preprocess all those attributes that are not relevant or useful to our mining task are pruned, only those useful attributes are left ready for mining.


AIS Algorithm

The AIS (Agrawal, Imielinski, Swami) algorithm was the first algorithm proposed for mining association rule in [138]. It focuses on improving the quality of databases together with necessary functionality to process decision support queries. In this algorithm only one item consequent association rules are generated, which means that the consequent of those rules only contain one item.

For example, based on transaction T100 I1; I2; I5, according to this specific order and generate candidate 2 -itemsets by extending I1 with only I2; I5, similarly I2 is extended with I5. The result is shown in Table I(d). During the second pass over the database, the support count of those candidate 2-itemsets are accumulated and checked against the support threshold. Similarly those candidate (k+1)-itemsets are generated by extending frequent k-itemsets with items in the same transaction. All those candidate itemsets generation and frequent itemsets generation process iterate until any one of them becomes empty. The result frequent itemsets includes only one large 3-itemsets I1, I2, I5.

To make this algorithm more efficient, an estimation method was introduced to prune those itemsets candidates that have no hope to be large, consequently the unnecessary effort of counting those itemsets can be avoided. Since all the candidate itemsets and frequent itemsets are assumed to be stored in the main memory, memory management is also proposed for AIS when memory is not enough. One approach is to delete candidate itemsets that have never been extended. Another approach is to delete candidate itemsets that have maximal number of items and their siblings, and store this parent itemsets in the disk as a seed for the next pass. The detail examples are available in [138]. Drawback of the AIS algorithm is too many candidate itemsets that finally turned out to be small are generated, which requires more space and wastes much effort that turned out to be useless. At the same time this algorithm requires too many passes over the whole database.

Apriori Algorithm

Apriori is a great improvement in the history of asso- ciation rule mining, Apriori algorithm was first proposed by Agrawal in [149]. The AIS is just a straightforward the main approach that requires many passes over the database, generating many candidate itemsets and storing counters of each candidate while most of them turn out to be not frequent. Apriori is more efficient during the candidate generation process for two reasons; Apriori employs a different candidates generation method and a new pruning technique. GenerateRules functions were elaborated in [149].

Apriori algorithm still inherits the drawback of scanning the whole data bases many times. Based on Apriori algorithm, many new algorithms were designed with some modifications or improvements. Generally there were two approaches: one is to reduce the number of passes over the whole database or replacing the whole database with only part of it based on the current frequent itemsets, another approach is to explore different kinds of pruning techniques to make the number of candidate itemsets much smaller. Apriori-TID and Apriori-Hybrid [149], DHP [150], SON [151] are modifications of the Apriori algorithm.

Most of the algorithms introduced above on the Apriori algorithm and try to improve the efficiency by making some modifications, such as reducing the number of passes over the database; reducing the size of the database to be scanned in every pass; pruning the candidates by different techniques and using sampling technique. However there are two bottlenecks of the Apriori algorithm. One is the complex candidate generation process that uses most of the time, space and memory. Another bottleneck is the multiple scan of the database.

3.4.2 Frequent Pattern Tree (FP-Tree) Algorithm

To break the two bottlenecks of Apriori series algorithms, some works of association rule mining using tree structure have been designed. FP-Tree [137], frequent pattern mining, is another milestone in the development of association rule mining, which breaks the two bottlenecks of the Apriori. The frequent itemsets are generated with only two passes over the database and without any candidate generation process. FP-Tree was introduced by Han et al in [132]. By avoiding the candidate generation process and less passes over the database, FP-Tree is an order of magnitude faster than the Apriori algorithm. The frequent patterns generation process includes two sub processes: constructing the FP-Tree, and generating frequent pattern efficiency of FP-Tree algorithm account for three reasons. First the FP-Tree is a compressed representation of the original database because only those frequent items are used to construct the tree, other irrelevant information are pruned. Also by ordering the items according to their supports the overlapping parts appear only once with different support count. Secondly this algorithm only scans the database twice. The frequent patterns are generated by the FP-growth procedure, constructing the conditional FP-Tree which contains patterns with specified suffix patterns, frequent patterns can be easily generated. Also the computation cost decreased dramatically. Thirdly, FP-Tree uses a divide and conquer method that considerably reduced the size of the subsequent conditional FP-Tree; longer frequent patterns are generated by adding a suffix to the shorter frequent patterns. In [136], are to illustrate all the detail of this mining process.

Every algorithm has its limitations, for FP-Tree it is difficult to be used in an interactive mining system. During the interactive mining process, users may change the threshold of support according to the rules. However for FP-Tree the changing of support may lead to repetition of the whole mining process. Another limitation is that FP-Tree is that it is not suitable for incremental mining. Since as time goes on databases keep changing, new datasets may be inserted into the database, those insertions may also lead to a repetition of the whole process if FP-Tree algorithm was employed.

3.4.3 Rapid Association Rule Mining (RARM)

RARM [152] is another association rule mining method that uses the tree structure to represent the original database and avoids candidate generation process. RARM is claimed to be much faster than FP-Tree algorithm.

Preprocessing of the database is scanned to construct the table process is similar to the process of generation the FP-Tree. For each transaction all the possible itemsets combinations are extracted and for those items that are already in the FP-Tree algorithm account for three reasons. First the FP-Tree is a compressed representation of the original database because only those frequent items are used to construct the tree, other irrelevant information are pruned. Also by ordering the items according to their supports the overlapping parts appear only once with different support count. Secondly this algorithm only scans the database twice. The frequent patterns are generated by the FP-growth procedure, constructing the conditional FP-Tree which contains patterns with specified such patterns; frequent patterns can be easily generated. Also the computation cost decreased dramatically. Thirdly, FP-Tree uses a divide and conquer method that considerably reduced the size of the subsequent conditional FP-Tree, longer frequent patterns are generated by adding to the shorter frequent patterns. In [154] [138] there are examples to illustrate all the detail of this mining process.

An interactive application, in which users can modify the support and confidence thresholds according to the rule results, was also proposed in [148]. All the methods in [148] mainly concern about intra concept level association rules, while mining cross level rules, which means that the antecedents and the consequences of the rules belong to different concept levels, was introduced in [148]. For multiple level association rule mining, usually there is more than one possible way to classify all the items into a concept hierarchy; different users may prefer different hierarchies. Multiple concept level can provide information according to different requirements in different fields. However, with different concept levels, mining in multiple concept levels may produce a lot of rules that are redundant. Some of the rules may carry the same information or some knowledge contained within a rule may be also contained in other rules duplicately. A further selection among those result rules is required to provide users useful, concise knowledge.

3.4.4 Multiple Dimensional ARM

Most algorithms and techniques discussed above only concern about association rules within single attribute and Boolean data, all those rules are about the same attribute and the value can only be yes/1 or no/0. By mining multiple dimensional association rules can generate the rules such as: age(X,"20|29") V occupation(X,"student")  buys(X,"laptop") rather than rules in a single dimension such as: buys(diaper)  buys(beer). Multiple dimensional association rule mining is to discover the correlation between different predicts/attributes. Each attribute/predict is called a dimension, such as: age, occupation and buys. At the same time multiple dimensional association rule mining concerns all types of data such as Boolean data, categorical data and numerical data [143].The mining process is similar to the process of multiple level association rule mining. Firstly the frequent 1-dimensions are generated, then all frequent dimensions are generated based on the Apriori algorithm. The mining process is straightforward, but there are three basic approaches for generating multiple dimensions.