Dynamic Data Mining

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A few researches have touched dynamic data mining [2, 16, 18, 20, 22, 25, 33, 34, 35, 36, 39, 40, 56, 57] whereas more have targeted static data mining [9, 17, 21, 43, 44, 45, 48, 58, 59, 60, 61, 62, 63, 64, 65]. This dissertation is closely related to dynamic data mining, more specifically, to Association Rules. Section 2.1, will give a brief survey for association Rules algorithms, and introduce the basic concepts of different types of association Rules algorithms, section 2.2 discusses various static data mining algorithms. Section 2.3 will introduce dynamic data mining algorithms. While section 2.4 will discusses Dynamic Association Rules.

2.1 Association Rules

As dynamic data mining has recently become an important research domain, much work has been done on classification [26, 29, 66], regression analysis [67] and clustering [68]. In this dissertation, we focus on association rule.

Han et al [65] proposed a novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and developed an efficient FP-tree-based mining method; the construction of FP-Tree requires two full I/O scans. The first scan generates the frequent 1-itemsets, the second scan, non-frequent items are stripped off the transactions and the sub-transactions with frequent ones are ordered based on their support. These sorted sub-transactions form the paths of the tree. Sub-transactions that share the same prefix share the same portion of the path starting from the root. The FP-Tree has also a header table containing frequent items. This table holds the header link for each item in the FP-Tree, and connects nodes of the same item to facilitate the item traversal during the mining process. However, the recursive method to mine the FP-tree requires significant memory, and large databases quickly blow out the memory stack.

Raghavan et al [48] presented a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. They present an efficient algorithm that generates all significant association rules between items in the database; the algorithm incorporates buffer management, novel estimation and pruning techniques, also presented results of applying their algorithm to sales data obtained from a large retailing company.

Nayak et al [43] developed an association rule algorithm called ~AR that accepts partial support from data. By generating these “approximate” rules, data can contribute to the discovery despite the presence of noisy or missing values; it's built upon the Apriori algorithm and uses two main steps to handle missing and noisy data. First, missing values are replaced with a probability distribution over possible values represented by existing data. Second, all data contributes probabilistically to candidate patterns. Patterns which receive a sufficient amount of full or partial support are kept and expanded.

Borgelt and Kruse [44] described an implementation of the well-known apriori algorithm for the induction of association rules that is based on the concept of a prefix tree. There implementation is based on the idea of organizing the counters for the itemsets in a special kind of prefix tree, which allows to store them efficiently (using little memory), and supports processing the transactions as well as generating the rules. Each nS denotes a counter for an itemset S. The edge labels on a path from the root to some node specify the common part of the itemsets for which there are counters in that node. They deal with sets, not sequences, abcd for instance, is the same as badc, and therefore only one of these counters is needed. During the first step of the apriori algorithm this tree is created level by level, in a second traversal the two element sets are checked and so on. Of course, in doing so, some branches of the tree can be burned.

ElHajj et al [45] proposed a new efficient algorithm for finding frequent patterns called COFI* based on COFI-trees [1] and FP-Trees [65]. This algorithm generates the frequent patterns by finding efficiently the set of local maximal for each COFI- tree; generate all their subsets that are indeed frequent. Counting the support for each of these frequent patterns is the last major step in this approach. The main differences between COFI* and COFI are the followings: (1) COFI* uses a novel technique to generate candidates and count their supports. (2) COFI* proposes a new data structure to categorize the itemsets helping the handling of patterns of arbitrary length. (3) COFI* finds the set of all frequent patterns by first finding the set of maximal patterns using a novel traversal approach, and then generate the set of all frequent patterns. (4) COFI uses a depth-first strategy, while COFI* uses a leap-traversal approach introduced in this paper.

2.2 Static Data Mining

Kenji et al [62] modeled semi-structured data and patterns by labeling ordered trees and studied the problem of discovering all frequent tree-like patterns that have at least min-sup support in a given collection of semi-structured data. They represented an efficient pattern mining algorithm FREQT for discovering all frequent tree patterns from large collection labeled ordered trees.

Raedt and Kersting [58] identified some of the key concepts and techniques underlying probabilistic logic learning. And explained the differences between the various approaches and at the same time provide insight into some of the remaining challenges in probabilistic logic learning. The techniques of probabilistic logic learning were analyzed starting from a logical (or inductive logic programming) perspective. Furthermore, principles of both statistical learning and inductive logic programming (or multi-relational data mining) are employed for learning the parameters and structure of the probabilistic logics considered.

Vijay Pithadia [17] defined: how data mining and KDD technology can facilitates analyses of the data in order to get the important knowledge hidden inside the data.

Salvatore [63] described the JAM (Java Agents for Meta-learning) system, it is a distributed, scalable and portable agent-based data mining system that employs a general approach to scaling data mining applications which is called meta-learning. JAM provides a set of learning programs, implemented either as JAVA applets or applications, which compute models over data stored locally at a site. JAM also provides a set of meta-learning agents for combining multiple models that were learned at different sites. It employs a special distribution mechanism which allows the migration of the derived models or classifying agents to other remote sites. One of Jam's target applications is fraud and intrusion detection in financial information systems, an important application area of Federal Information Services.

Jin and Ag awal [18] presented the design and initial performance evaluation of a middleware, enabling rapid development of the parallel data mining applications, which can help exploit parallelism on both shared memory and distributed memory configurations. They studied parallel versions of methods, in each of these methods; parallelization can be achieved by dividing the data instances (or records or transactions) among the nodes. The computation on each node involves reading the data instances in an arbitrary order, processing each data instance, and performing a local reduction.

Guo et al [60] proved that large data volumes and/or high dimensionality can cause serious efficiency problems. If a method does not scale well with the data size (n) and/or the dimensionality (d), its performance (in terms of execution time) can degrade rapidly as (n) or (d) increases.

Thearling [46] provides an introduction to the basic technologies of data mining. Examples of profitable applications illustrate its relevance to today's business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.

Daniel S. [61] described the information mining operations and techniques, and highlights the applications that have proven to provide competitive advantages in many enterprises world-wide.

2.3 Dynamic Data Mining

Due to the continuous, unbounded, and high speed characteristics of dynamic data, there is a huge amount of data in both offline and online data streams, and thus, there is not enough time to rescan the whole database or perform a rescan as in traditional data mining algorithms whenever an update occurs. Furthermore, there is not enough space to store all the stream data for online processing.

Ganti et al [69] examine mining of data streams. A block evolution model is introduced where a data set is updated periodically through insertions and deletions. In this model, the data set consists of conceptually infinite sequence of data blocks D1, D2, ... that arrive at times 1, 2, ... where each block has a set of records. Some applications require mining all of the data encountered thus far (unrestricted window scenario), while others require mining only the most recent part (restricted window scenario) and updating the models accordingly. The authors highlight two challenges in mining evolving blocks of data: change detection and data mining model maintenance. In change detection, the differences between two data blocks are determined. Next, a data mining model should be maintained under the insertions and deletions of blocks of the data according to a specified data span and block selection sequence.

Kargupta et al [70] developed the first ubiquitous data stream mining system: MobiMine. It is a client/server PDA-based distributed data stream mining application for stock market data. It should be pointed out that the mining component is located at the server side rather than at the PDA. There are different interactions between the server and PDA until the results finally displayed on the PDA screen. The tendency to perform data mining at the server side has been changed with the increase in the computational power of small devices.

Crespoa, and Weberb [11] presented a methodology for dynamic data mining using fuzzy clustering that assigns static objects to dynamic classes. Changes that they have studied are movement, creation, and elimination of classes and any of their combination.

Chung and Mcleod [37] proposed mining framework that supports the identification of useful patterns based on incremental data clustering, they focused their attention on news stream mining, they presented a sophisticated incremental hierarchical document clustering algorithm using a neighborhood search.

Zaslavsky et al [16] discussed data stream mining and the importance of its applications; the proposed techniques have their roots in statistics and theoretical computer science. Data-based techniques and task-based techniques are the two categories of data stream mining algorithms. Based on these two categories, a number of clustering, classification, and frequency counting and time series analysis have been developed.

Babu et al [4] focused on the problem of query processing, specifically on how to define and evaluate continuous queries over data streams, address semantic issues as well as efficiency concerns; they specified a general and flexible architecture for query processing in the presence of data streams. They also used their basic architecture as tool to clarify alternative semantics and processing techniques for continuous queries. They mapped out research topics in the area of query processing over data streams.

Reigrotzki et al [57] presented the application of several process control-related methods applied in the context of monitoring and controlling data quality in financial databases. They showed that the quality control process can be considered as a classical control loop which can be measured via application of quality tests which exploit data redundancy defined by meta-information or extracted from data by statistical models. Appropriate processing and visualization of the tests results enable human or automatic detection and diagnosis of data quality problems. Moreover, the model-based methods give an insight into business-related information contained in the data. The methods have been applied to the data quality monitoring of a real financial database at a customer site, delivering business benefits, such as improvements of the modeling quality, a reduction in the number of the modeling cycles, and better data understanding. These benefits in turn lead to financial savings.

2.4 Dynamic Association Rules

Mining association rules involves a lot of memory and CPU costs. This is especially a problem in dynamic data since the processing time is limited to one online scan. Therefore, a one scan of data and compact memory usage of the association rule mining technique are necessary.

The problem of maintaining discovered association rules was first addressed in [25]. The authors proposed an incremental mining algorithm, called FUP (Fast UPdate algorithm), for incrementally maintaining mined association rules. The FUP algorithm modifies the Apriori mining algorithm [30] and adopts the pruning techniques used in the DHP (Direct Hashing and Pruning) algorithm [72]. It first calculates large itemsets mainly from newly inserted transactions, and compares them with the previous large itemsets from the original database. According to the comparison results, FUP determines whether re-scanning the original database is needed, thus saving some time in maintaining the association rules. Although the FUP algorithm can indeed improve mining performance for incrementally growing databases, original databases still need to be rescanned when necessary.

A more general algorithm, called FUP2, was proposed later in [52] which can update the discovered association rules when new transactions are added to, delete from, or modified in the database. However in a data stream environment, stream data are added continuously, and therefore, if we update association rules too frequently, the cost of computation will increase drastically. FUP2 is efficient not only on developing of a database but also on trimming data.

Ayan et al. proposed the UWEP (Update With Early Pruning) method that employs a dynamic look-ahead strategy to update existing frequent itemsets for detecting and removing itemsets that are infrequent in the updated database [95].

Raghavan et al [22] introduced a Dynamic Data Mining approach. The proposed approach performs periodically the data mining process on data updates during a current episode and uses that knowledge captured in the previous episode to produce data mining rules. They introduced the concept of locality along with the definitions of emerged-large itemsets and declined-large itemsets, they solved some of the problems that current data mining techniques suffer from, such as, database updates, accuracy of data mining results, gaining more knowledge and interpretation of the results, and performance, but in their work they only presented solution for data insertion.

Lee et al [71], the authors proposed an algorithm, called DELI, which uses a sampling technique to estimate the difference between the old and new association rules. If the estimated difference is large enough, the algorithm signals the need of an update operation; otherwise, it takes the old rules as an approximation of the new rules. It considers the difference in association rules, but does not consider the performance of incremental data mining algorithms for evolving data, which is especially the situation in data stream mining.

Chang et al. proposed the NFUP method [56]. NFUP partitions the incremental database logically according to unit of time interval (month, quarter or year, for example). NFUP progressively accumulates the occurrence count of each candidate according to the partitioning characteristics. The latest information is at the last partition of incremental database. NFUP scans each partition backward, namely, the last partition is scanned first and the first partition is scanned last.

Tzung-Pei Hong et al [39] proposed a new mining algorithm based on scanning the original database only once to further reduce the need for rescanning original databases. Since rescanning the database needs much computation time, the maintenance cost can thus be reduced. The concept of pre-large itemsets was proposed to solve the problem of an itemset which is not large in the original database, but it is large in the newly inserted transactions. A lower support threshold and an upper support threshold are used to realize this concept. The upper support threshold is the same as that used in the conventional mining algorithms. The support ratio of an itemset must be larger than the upper support threshold in order to be considered large. On the other hand, the lower support threshold defines the lowest support ratio for an itemset to be treated as pre-large. An itemset with its support ratio below the lower threshold is thought of as a small itemset. Pre-large itemsets act like buffers in the incremental mining process and are used to reduce the movements of itemsets directly from large to small and vice-versa.

Zheng et al [73] proposed a metric distance as a difference measure between sequential patterns and used a method, called TPD, to decide when to update the sequential patterns of stream data. The authors also suggested that some initial experiments be done to discover a suitable incremental ratio and then this ratio be used to decide when would be suitable to update sequential patterns. The TPD method is only suitable for streams with little concept drifting, that is to say the change of data distribution is relatively small.

Agrawal et al [50] studied the option of expressing the mining algorithm in the form of SQL queries using Association rule mining. They considered four options in SQL-92 and six options in SQL enhanced with object-relational extensions (SQL-OR). Their evaluation of the different architectural alternatives shows that from a performance perspective, the Cache-Mine option is superior, although the performance of the SQL-OR option is within a factor of two. Both the Cache-Mine and the SQL-OR approaches incur a higher storage penalty than the loose-coupling approach which was a 3 to 4 factors worse in performance wise than Cache-Mine.

Jiang et al [2] discussed general issues that need to be considered for all data association rule mining algorithms for data streams. The first issue addresses which parts of data streams are selected to apply association rule mining. The issue of data processing model is to find a way to extract transactions for association rule mining from the overall data streams. The next fundamental issue is: how to optimize the memory space consumed when running the mining algorithm. This includes a decision on how the information must collected from data streams and how to choose a compact in-memory data structure that allows the information to be stored, updated and retrieved efficiently. Another fundamental issue is to choose the right type of mining algorithms. They discussed the issues that need to be considered to generate and maintain frequent itemsets and association rules in data streams. They also described application dependent issues.

Hong et al [40] proposed an algorithm that maintains generalized association rules based on the concept of pre-large itemsets [39] for deleted data. The concept of pre-large itemsets is used to reduce the need for rescanning original databases (doesn't need to rescan the original database until a number of records have been deleted) and to save maintenance costs and time. The proposed algorithm can efficiently and effectively maintain association rules with a taxonomy based on the pre-large concept and to further reduce the need for rescanning original databases. But it still requires a data base rescan.

Lee et al. proposed the SWF approach [94]. SWF partitions databases into several partitions, and applies a filtering threshold in each partition to generate candidate itemsets.