Data Mining And Knowledge Discovery In Databases Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data mining, and knowledge discovery in databases, has been universally recognized as an important research issue with broad applications. This report present a complete review, on the progress of data mining techniques since 1990's,literature review on data mining methods and algorithms which includes characterization, classification, clustering, association, evolution, and data visualization, and issues on challenges in data mining specially on the multimedia data. It also comprises some of the real-world applications, and current and future research directions in the multimedia field.


The Data mining is a process of extracting or mining information from previously unknown knowledge, on detection of interesting patterns as algorithms from a massive set of data. [Multimedia paper] Knowledge Discovery in Databases is the higher-level process of acquiring information through data mining and distilling the information into knowledge through interpretation of information and integration with existing knowledge. (Paper 3) Data mining refers to a particular step in the mining process whereas KDD refers to the general process of discovering useful knowledge from the given data. Data mining has become a high demanding task, attracts lot of researchers and developers since 1990's, and made a good progress in the past several years.


Data-mining systems originated in the 1980s, mainly focussed on single tasks which includes building classifiers using a decision-tree tool2 such as C4.5 consisting of research-driven tools. It also focuses on finding clusters in data such as Autoclass3 and Data Visualization which uses the Alfred Inselberg's parallel-coordinate approach4. But these tools addressed a generic data-analysis problem, and the projected user suffered technical complicated problems on using more than one tool on the same data set which requires significant data and metadata transformation. In order to address these complications data-mining vendors developed the second- generation data-mining systems, named as suites, around 1995. The suites, such as SPSS's Clementine, IBM's Intelligent Miner, Silicon Graphics' Mineset, and The SAS Institute's Enterprise Miner, made the user perform several discovery tasks On the classification techniques, clustering techniques, visualization techniques and supported data transformation. By the year 1998, according to Herb Edelstein, 6 the data-mining made a 100% improvement from 1997{952}The third generation of vertical data-mining based applications and solutions is developed in the late1990s. It was oriented toward solving a specific business problem, involves the detection of credit card fraud or predicting cell phone customer attrition. These interfaces were oriented to the business user and hid all the data-mining complexity. The newer areas of data mining in text and multimedia and Web mining are growing rapidly in recent years.


Data mining includes many different types of methods in order to analyze and classify data. The most common methods include Characterization, Classification, Cluster Analysis, Bayesian inference and inductive learning. Characterization provides the general compressed description of the data which includes visualization of the data and the basic statistical functions such as average or deviation.(5853) Cluster analysis is the process of partitioning data objects into meaningful groups or clusters by identifying and analyzing the patterns based on the numerical measures or statistical data includes the input components such as raw data and information from a data dictionary etc Applications of clustering include data mining, document retrieval, image segmentation, and pattern classification. (1doc). Bayesian interface method attempts to change the classification to maximize the conditional probability that the group matches the actual data structure under the condition of the available data.(1doc) Inductive learning groups object base on its attribute into one of the existing classes. The ID3algorithm is a well-known example of this approach.


There are many successive algorithms in the data mining field,below is the some of the algoritms which has developed in recent years and has an successive growth.

The k-means algorithm:

The k-means algorithm is a simple iterative method, it partitions a given dataset into a user specified number of clusters, k. This algorithm has been discovered by several researchers across different disciplines, most notably Lloyd (1957, 1982) [53], Forgey (1965), Friedmanand Rubin (1967), and McQueen (1967). The algorithm operates on a set of d-dimensional vectors which picks the k points known as "centroids". The k-means algorithm consists of two steps, the Data Assignment where each point of data is assigned to its closest centroid, resulting in a partitioning of the data. The Relocation of "means"relocates the cluster representative to the center. The k-means algorithm suffers from several problems such as the algorithm is sensitive to the presence of outliers, since "mean" is not a robust statistic.

The Apriori algorithm

The Apriori algorithm is the recent growth in the data mining field. Many algorithms on pattern finding which includes decision tree, classification rules and clustering techniques that are frequently used in data mining have been developed by machine learning research community. But this apriori algoritm is a frequent pattern finding and association rule mining is one of the few exceptions to this tradition. Apriori is a shaping algorithm which finds the frequent item sets in a transaction dataset and derives association rules using candidate generation (10alg). As this algorithm is introduced the field of the data mining is boosted in its research field and the impact of this algorithm is tremendous. The algorithm is quite simple and easy to implement. The most outstanding improvement over Apriori is an development of a method called FP-growth(frequent pattern growth) that succeeded in eliminating candidate generation.(10 alg)

Page Rank Algorithm

Page Rank is a search ranking algorithm, using hyperlinks on the Web.It was presented and published by Sergey Brin and Larry Page at the Seventh International World Wide Web Conference (WWW7) in April 1998. The search engine Google, which has been a very success is based on this algorithm, thus in recent days every search engine has its own hyperlink is based on the page ranking method. Page Rank produces a static ranking of Web pages in the sense that a PageRank value is computed for each page off-line and it does not depend on search queries. The algorithm relies on the democratic nature of the Web by using its vast link structure as an indicator of an individual page's quality.

AdaBoost Algorithm

Ensemble learning deals with data mining methods which employs multiple learners to solve a problem. The generalization ability of an ensemble is better than that of a single learner, so ensemble methods are highly beneficial and attractive. The AdaBoost algorithm proposed by Yoav Freund and Robert Schapire is one of the most important ensemble methods, since it has solid theoretical foundation, very accurate prediction, great simplicity and wide and successful applications. Boosting has become the most important "family" of ensemble methods. AdaBoost has given rise to abundant research on theoretical aspects of ensemble methods, which can be easily found in machine learning and statistics literature.


Data mining is now becoming a mature technology, it is important that appropriate standards be established for various aspects of data mining. As the field of data mining has ever growing progress standardization cannot be avoided. It is to be examined that the various processes model could be applied for modelling the data mining. Standardization will enable several standard methods and procedures to be developed for data mining so that the entire process of data mining could be made easier for different types of users. (Paper10). The two major challenges on standardising the data mining is agreeing on a common standard for cleaning, transforming, and preparing

data for data mining and agreeing on a common set of Web services for working with remote and distributed data. In order to overcome these challenges standards such as Data mining metadata standards, process standards and web standards are developed.(paper 13)

Multimedia data mining:

{multimedia paper,11 paper}

Multimedia Data Mining is the process of mining and analysis of various types of data, including animation, images, audio, video based on information and knowledge from large multimedia databases. As Multimedia Data Mining includes hypertext and hypermedia mining in the areas of text mining as these fields are closely related. The general characteristic in many data mining applications, including the multimedia data mining applications is that, the specific features of data are captured as the feature vectors or tuples in a table or relation and then tuple-mined. In multimedia data mining applications, feature extraction is used to convert the raw multimedia data to relational or tabular form, and then the tuples or rows are data mined.


Video-Audio data mining and other multimedia data mining often involves a preliminary feature extraction step in which the pertinent data is formed into a relation of tuples or possibly time series of tuples, each tuple describing specific selected features of a "frame". P-tree provides a common structure for multi-mediaPeano Count Trees (P-trees).The P-tree data structure is designed for just such a data mining setting. P-trees provide a lossless, compressed, data mining-ready representation of the relational data set [7].Given a relational table (with ordered tuples or rows),the data can be organized in different formats. BSQ, BIL and BIP are three typical formats. The Band Sequential (BSQ) format is similar to the relational format. In BSQ format, each attribute is stored as a separate file and each individual band uses the same tuple ordering. Thematic Mapper (TM) satellite images are in BSQ format. For images, the Band Interleaved by Line (BIL) format stores the data in line-major order, i.e., the first row of all bands, followed by the second row of all bands, and so on. SPOT images, which come from French satellite platforms, are in BIL format. Band Interleaved by Pixel (BIP) is a pixelmajor format. Standard TIFF images are in BIP format. We propose a new generalization of BSQ format called bit Sequential (bSQ), to organize any relational data set with numerical values [7]. We split each attribute into separate files, one for each bit position. There are several reasons why we use the bSQ format. First, different bits make different contributions to the values. In some applications, the high-order bits alone provide the necessary information. Second, the bSQ format facilitates the representation of a precision hierarchy. Third, bSQ format facilitates compression. P-trees are basically quadrant-wise, Peano-order-run-length-compressed, representations of each bSQ file. Fast P-tree operations,especially fast AND operation, provide the possibilities for efficient data mining.


Multimedia Miner is a current example multimedia data mining systems, is a system prototype for multimedia data mining. This system contains of four major components which mainly used for the extraction of images and videos as image excavator from multimedia. IBM's Query on image content and MIT's Photo book extract image features which includes the factors such as intensities, color, histogram hues and quantity measuring texture. These features have been extracted and the each image is represented in the multidimensional space with respected to the co-ordinate axis.

Future challenges in multimedia data mining:

A developing area in multimedia DataMining is that of audio DataMining in mining music. The mining music idea is represented as to use the audio signals in order to specify the patterns of data or by representing the features of the data mining results. The data mining method is possible not only to summarize the melodies present in the music but it should also extract the summarized style on tone, tempo, or the major musical instruments played on the music or the musical content(Han & Kamber, 2001;Zaiane, Han, Li, & Hou, 1998; Zaiane, Han, & Zhu, 2000).


Web mining is one of the most promising areas in DataMining, because the Internet and WWW are dynamic sources of information. Web mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the WWW (Etzioni, 1996). The main tasks that comprise Web mining include retrieving Web documents, selection and processing of Web information, pattern discovery in sites and across sites, and analysis of the patterns found (Garofalis, Rastogi, Seshadri & Shim, 1999; Kosala & Blockeel, 2000; Han, Zaiane, Chee, & Chiang, 2000).Web mining can be categorized into three separate areas: web-content mining, Web-structure mining, and Web-usage mining. Web-content mining is the process of extracting knowledge from the content of documents or their descriptions. This includes the mining of Web text documents, which is a form of resource discovery based on the indexing of concepts, sometimes using agent-based technology. Web-structure mining. Instead of looking at the text and data on the pages themselves, Web-structure mining has as its goal the mining of knowledge from the structure of websites. More specifically, it attempts to examine the structures that exist between documents on a website, such as hyperlinks and other linkages. For instancelinks pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. Web-usage mining. Yet another major area in the broad spectrum of Web mining is Web-usage mining. Rather than looking at the content pages or the underlying structure, Web-usage mining is focused on Web user behavior or, more specifically, modeling and predicting how a user will use and interact with the Web.