Methods For Extracting Information From Unstructured Data Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In this paper I will present different methods for mining data from multimedia contents from unstructured data. Today, most of users use the information available on the web for different purposes. But, since most of this information is only available as HTML documents, picture, sound and video; a lot of techniques are defined that allow information from the web to be automatically extracted. If we analyze these multimedia files, lots of useful information can be revealed for the users. Extracting data from the Internet is more than an extension of data mining, because it is an effort which is based on computer graphics, data retrieval, multimedia mining, artificial intelligence, XML and databases.

Keywords: data mining, information extraction, multimedia mining, text mining, image mining, audio mining, video mining, wrapper induction.


The data on the web which is understandable only for the humans can be labeled and classified so intelligent agents can extract information from it [7]. They can interact more efficiently with the data on the web, thus make better searches, schedule our appointments [8], etc.

This can be achieved through the use of accurate semantic classification of the human readable data on the World Wide Web. This labeling can be done by powerful algorithms which benefit from the DOM tree of the websites. We can use the tree edit distance method to find the best mapping between the two cases. This will allow marking and extracting only the matches between the two cases, removing data specific only for one case.

Normally, this task is for the providers, because they can access the relational data of the pages, and are able to modify the published content. More than a few tools, like browsers and ontology search engines, have been created with the only goal to make it easier for content providers to add semantic label to their web pages. But, they have not been well adopted by content providers.

The best way to add semantic meaning to web pages is to provide a tool which allows users, to create and utilize their own labels for existing content. In particular, it is necessary to make the extraction of semantic content available to non-technical users, changing current user interfaces to provide them with the ability to add semantic meaning to web pages.

Information Extraction

The approach to pattern induction and matching is a case of a bigger work of extracting information. This field has received lots of attention in the latest time. Information extraction is covering the retrieval of data from semi-structured and unstructured documents in the Internet. Many approaches, using supervised and un-supervised learning, have been tried, with various levels of success.

Fig (1)

The subfield of information extraction which deals with documents on the Internet is called Wrapper Induction, and is defined by Kushmerick [1] as "the task of learning a procedure for extracting tuples from a particular information source from examples provided by the user". Kushmerick has defined the HLRT wrapper class, in the WIEN system. This wrapper was limited to locate information which was separated by four types of delimiting tags: the head, left, right and the tail. Because of this restriction, it could wrap with success only 48% of HTML data on the Internet.

A similar approach, The STALKER system [2], tries to extract some of the hierarchical structure HTML files and its semantic data. It uses the Embedded Catalog formalism which is made of sets of k-tuples with each element of the tuple, being relevant information to the user or reference to other k-tuples. So, the EC description of a page is a hierarchy like the subject-predicate-object structure used in the RDF.

Some other approaches for extracting information use the probabilistic model. The Hidden Markov Models offer the possibility to learn by probabilities and its structure, in order to represent the information in various document types [3]. These methods can also treat the document as a list of objects, first parsing it into tokens like HTML tags and text. The structure of the HMM can be either hand crafted or is learned from the given set of training materials, by using random optimization. These methods were very successful in extracting information from semi-structured documents, like: academic papers; but they were not very successful in HTML documents.

Some of the models can learn to categorize data by supposing that near elements in a hierarchy can be classified correspondingly [4]. For example, the DOM tree on a web-site is beneficial for understanding which documents are related. If most documents in a certain directory sub-tree have been categorized in a certain way, then the new documents appearing in close sub-trees are more possible to be likewise categorized.

Probabilistic Context Free Grammars are techniques for semantic tagging semi-structured data [5]. When they are used to extract semantic content from English sentences, a probabilistic model is learned by parsing marked cases and stating the occurrence of certain context-free rules in the training data. This model is used to label new sentences by finding the most probable set of rules which might have created the sentence. With semantic tags, PCFG can be useful to tag phrases with semantic meaning, the first step in information extraction.

Another example of an interactive system for pattern learning on various types of documents is LAPIS [6]. This system can provide an interactive interface where users may identify examples relevant to a pattern by highlighting them. Patterns are created by a language termed text constraints, which has operators such as before, after, contains, and starts-with. By using a pre-defined library of parsers which tokenize and tag the document, users are able to create patterns of arbitrary complexity, or allow the system to conclude them from the given examples. This conclusion is performed by constructing a dictionary of region groups, which define areas of the document that match certain parts from the parsers. By analyzing the intersections and recurrences of these region sets, LAPIS extracts its structured text. The result of matching these patterns is then displayed for the user, allowing him to perform tasks like: editing and find important content.

Information Extraction Methods

Multimedia data includes text and images (which are still media), audio and video (which are continuous media). The issues about still and continuous media are different and here we will consider mining these multimedia data types.

Data mining has an impact to the functions of multimedia database systems. For example, the query processing has to be adapted to handle mining queries for a tight integration between the data miner and the database system. This will have impact on storage strategies and the data model. Today, mining tools work exclusively on relational databases, but when using object-oriented databases for multimedia data modeling, then has to be developed mining tools to handle them.

Data mining tools are modeling data as collection of similar independent entities and its goal is to search for common patterns to entities. Fitting multimedia in this 'picture' is very hard. Pictures and videos of different objects have common things, they display objects, but with no clear structure it is difficult to relate multimedia mining with data mining. Multimedia gives a lot of data for each entity, but not the same data on each entity.

Another difference between multimedia mining and structured data mining is the time, because multimedia often captures a changing entity over time. Audio, video and text are ordered and they have no meaning without sequence. Multimedia is very complex, as the sequence progresses, the represented concept may change. This is important to video, where objects may move.

Text Mining

Most of the information is in the text form, it can be data on the web, electronic books, etc. The biggest problem with text data is that it is not structured as relational data. In some cases, text is structured or semi-structured [9]. Semi-structured data can be an article with structured format like: title, author abstract and unstructured paragraphs.

Text mining is all about extracting patterns and associating unknown content from data in text form. The difference between text mining and data retrieval is the same as the difference between data mining and query processing. Query processing and information retrieval needs specific data item, as in the case of mining higher level concepts in many items. The newest information retrieval and text processing tools find associations between words and paragraphs, so they can add semantic meaning to this content.

Data in object-oriented databases, rarely hear about data mining tools on that data. So, current mining tools cannot apply to text data. The current direction in mining of unstructured data includes these steps:

Extract data and metadata from unstructured databases by using tagging techniques, stored that data in structured databases and apply data mining tools on structured databases.

Integrate data mining techniques to information retrieval tools so appropriate data mining tools can be developed for unstructured databases.

When text data is converted to relational databases, there has to be carefully not to lose critical information. When the data is not good, the process of mining will not be efficient and it won't result of useful data. First, it is required to create a warehouse before mining the converted database. This is essentially a relational database which has the essential data from the text. It means that, a transformer is required to take e text corpus as input and outputs tables, for example extract the keywords form the text.

In text databases with several articles, it is possible to create a warehouse with tables which contains following attributes: author, date, publisher, title, and keywords. The keywords can be different and the job of the data miner is to make association between them.

A big effort has been given for information retrieval to augment the system to perform text mining. This is a product of attempts in improving information extraction. Many companies have produced products to identify frequent concepts in documents as a means to organize documents and improve information extraction. This is very useful information as opposed to simply as help in information extraction.

Another approach, when mining text directly, has been used on problems of text classification and text clustering [10]. There are several examples and groups have competed in solving text mining problems centered on a corpus where documents have been classified into topics. The direct approach is proven effective for classification and clustering. Some attempts to obtain other types of data mining results directly from unstructured data have had no success. Tries that see documents as sets of words or phrases, loose too much information and produce many meaningless results.

Image Mining

Text mining is in the first stage, but image mining is even further. Image processing is quite used in lot of applications as medical imaging for detecting cancer, satellite image processing for space applications, hyper-spectral images, etc. Images include many entities such as maps, geological structures, biological structures and others [11]. It deals with areas like abnormal patterns detection with deviation from the norm, retrieving images from the content and pattern matching.

If image processing is focusing on detecting abnormal patterns and retrieving images, then image mining is all about finding unusual patterns. Image mining deals with making associations between different images from image databases.

The first try for image mining was on 1977 and the plan was to extract metadata from images and carry out mining to metadata. This was essential mining the metadata in relational databases, but later it was discovered that images can be mined directly. In this case the challenge is to determine what type of mining outcome is most suitable, wherever to mine for associations between images, cluster images, classify images, or detect usual patterns or to mine a sequence of images and find out whether there are any unusual changes. But, the mining tools don't tell why the changes are unusual.

Detecting unusual patterns is not the only outcome of image mining and it has been tried to identify recurring themes in images, both at the level of raw images and with higher-level concepts.

But, still this is just the beginning. It is required to conduct more research on image mining to check wherever data mining techniques can be used to classify cluster and associate images. Image mining is topic with applications in numerous domains including space, medical and geological images.

Audio Mining

Audio and video are continuous media type, so the techniques for audio and video information processing and mining are the same. Audio can have different forms, like: radio, speech or even spoken language [12]. The TV news also has audio and it is integrated with video and maybe text for titles or other information.

Mining audio data, it can be converted into text with speech recognition software and other techniques like extracting keywords and mining the text data. Audio data can also be mined directly by using audio information processing techniques and then mining selected audio data.

Video Mining

Video mining is even more complicated than image, audio or text mining. We can see video as a collection of moving images. This is a subject of a lot of research. Important areas are developing query and retrieval techniques for video databases, including video indexing, query languages, and optimization strategies. There is no clear picture in video mining, unlike image or text mining. Video clips can be examined and a lot of common things can be found between clips or it can be used to find unusual patterns in video clips [13]. The first step in successful video mining is handling good image mining.

To be consistent in the terminology, it is possible to find correlation and previously unknown patterns from video databases in video mining. When a video clip or multiple video clips are analyzed, we can conclude some unusual behavior.

If an object in a video occurs to be there more times, this means that it is something significant [14]. When capturing the text in video format and making the associations, it is possible to transmit the text, but this time use the video data.

There is not much of information on analyzing video data. To convert the video mining problem to a text mining problem, it is reasonably well understood. But, the challenge to mine video data directly and knowing what to mine is a big challenge. Direct video mining is becoming very important with the emergence of the web.

It has been done a work to categorize based on characteristics of the video, rather than the associated text. It is possible to identify features based on cinematic principles (length of scenes, scene changes, etc.) [15] and use it as a classification input. Different approaches for summarizing have been suggested and they involve key frames or scene categorization.

Like text mining, in the case of video mining, a lot of work has been done to retrieve video [16]. Further, it is required to understand what types of knowledge need to be gained from video mining. In some cases, this can be straightforward, but to identify a wide set of applications for video mining is required before the research in this area will make the next step forward.

Mining different data types

Previously I have described the mining process of individual data types, like: text, images, audio and video. But, when mining multimedia, it is required to mine combinations of two or more data types [17].

To handle combinations of data types, is a very difficult process in resembles with dealing with different databases. For example, each different database's environment contains data which belong to multiple data types. These databases can be integrated and mined, or it is possible to apply mining tools on the individual databases and combine the results of the various data miners.

In both cases the Multimedia Distributed Processor has an important role. If the data is integrated before it is mined, then it is carried out with the MDP. If the data is mined first, the data miner increments the multimedia databases' management system and the results of the data miners are integrated with the Multimedia Distributed Processors.

Because there is a lot to be researched about mining of individual multimedia data types [18]: text, images, audio and video; there is even more work to be done about mining of different multimedia data types. First, it is required to handle well individual data types, and later to mine them altogether.


Here I have focused on the four types of multimedia data types: text, images, audio and video. I have defined what data mining means to such data types and discussed further the development and challenges. At the end, I discussed the issues when mining different multimedia data types.

Mainly, I have addressed mining of individual data types and not in combination. But, in most cases it consists of combination of two or more media types. With the development of multimedia data types mining, it is expected that also the mining of combination of different data types will be achieved.

There are two requirements for multimedia mining to become complete:

Mining techniques that models order as part of some data - If multimedia sequence is ignored, then too much data is lost. Both approaches of mining ordered data, time series and event sequences are not sufficient for multimedia. It is good to capture order in the result; example discovered patterns can include "first pattern before the second one".

Comparing objects that are represented differently - Pictures taken from different angles or photograph and drawing capture similar information. But, various representations lead data mining algorithms to overvalue the differences between some objects. When algorithms recognize similarities in data between two objects, then maybe they are the same. However, different data don't say that objects are different; differences how the data are captured may cause same objects to be represented by very different sets of data. Mining techniques must handle this issue.

If a progress is made on multimedia data mining, the lot of tools will emerge on mining multimedia data. Today's data mining tools work only on relational databases, but it is expected that in the future will be developed multimedia data mining tools also as tools for mining object databases. Lot of research is required for this to be achieved.