Methods Of Extracting Information From Unstructured Data Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In this paper I will present different methods for mining data from multimedia contents from unstructured data. Today, most of users use the information available on the web for different purposes. But since most of this information is only available as HTML documents, picture, sound and video; a lot of techniques are defined that allow information from the web to be automatically extracted. If we analyze these multimedia files, lots of useful information can be revealed for the users. Extracting data from the Internet is more than an extension of data mining, because it is an effort which is based on computer vision, data retrieval, multimedia mining, artificial intelligence, XML and databases.


The promise of the Semantic Web is to "bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users". Information which is currently prepared only for humans to read will be richly labeled, classified, and indexed to allow intelligent agents to schedule our appointments, perform more accurate searches, and generally interact more effectively with the sea of data on the Web.

These advances, however, rely on the accurate semantic labeling of data that currently exists only in human-readable format on the World Wide Web. Normally, this labeling would be a task for content providers, as they have easy access to the relational data which makes up the pages, as well as the ability to alter the existing published content. Several tools, such as browsers and distributed search engines for ontology, have been developed explicitly with the goal of making it easier for content providers to add semantic markup to their existing World Wide Web pages.

Unfortunately, providers often have little or no incentive to mark up their existing documents. In this work, then, we take a different approach. Our goal is to provide a tool which allows end-users, rather than content providers, to author and utilize their own semantic labels for existing content. In particular, we aim to make the extraction of semantic content accessible to non-technical users, modifying existing user interfaces to provide them with the ability to label web pages with semantic meaning. By giving these users control over semantic content, we hope to reduce the reliance of the Semantic Web on content providers and speed its adoption.

The patterns are created by a powerful algorithm which takes advantage of the inherent hierarchical structure of HTML. We utilize the technique of tree edit distance to find the best mapping between the given examples. This mapping allows us to highlight and extract only the structural elements that the examples have in common, discarding any instance-specific content. What is left is a generic pattern, capable of recognizing other instances of the same type.

Once a wrapper is created, the user may then give it semantic meaning by overlaying it with statements about the classes and properties it represents. These descriptions are created through a simple user interface, but trigger statements in RDF, the language which is the framework for the Semantic Web. By drawing these classes and properties from an existing ontology appropriate for the page in question, the user gives the wrapper a general meaning compatible with other sites of the same type.

Wrappers also provide a powerful means for importing, exporting, and manipulating the unstructured data on the Web. Once wrapped, the information is in a structured, relational format, RDF, which can be easily managed and queried. Disparate sources of similar data can be easily brought together. For instance, wrappers created on several news web sites can be integrated into a single RSS feed. Alternatively, wrappers of the same semantic type allow us to reformat and integrate data. A user could integrate all their news sites into a single page, formatted in whatever way is best for that user.

By creating wrappers, users are, in effect, creating a bridge between the syntactic structure and the semantic structure of the web page. In general, this parallel structure has always existed, abstractly, in the intentions of the page's creator and in the interpretations of the page's reader. In our system, however, the act of building a wrapper for this content makes the connection explicit on the user side. It is from this syntactic-semantic bridge that our wrappers get their power.

. Information Extraction

Our approach to pattern induction and matching is one case of the larger task of information extraction. This field, especially in relation to documents on the World Wide Web, has received much attention in recent years. Information extraction covers the automated retrieval of data from both structured and unstructured documents. Various approaches, using both supervised and unsupervised learning, have been tried, with varying degrees of success.

The subfield of information extraction dealing with documents on the World Wide Web is called wrapper induction, defined by Kushmerick as the task of learning a procedure for extracting tuples from a particular information source from examples provided by the user. Kushmerick defined the HLRT class of wrappers, implemented in the WIEN (Wrapper Induction ENvironment) system. These wrappers were restricted to locating information which is delimited by four types of flags, the "head," "left," "right," and "tail." Because of this limitation, this class was found to successfully wrap only 48% of relational data in HTML documents in a 1997 Web survey.

A related approach, the STALKER system, attempts to capture some of the hierarchical structure of HTML and its semantic data. The embedded catalog (EC) formalism consists of lists of k-tuples, with each element of the k-tuple being either information of relevance to the user or another k-tuple. The EC description of a page is therefore a hierarchy similar to the subject-predicate-object description used by RDF for the Semantic Web. The EC of a page allows the STALKER system greater flexibility with fewer examples in locating information in a page than HLRT wrappers. However, despite the hierarchical nature of the EC, STALKER's matching algorithm still treats the underlying HTML source of pages as a linear string, ignoring its hierarchical structure.

Several other approaches to information extraction utilize probabilistic models. Hidden Markov Models offer the opportunity to learn not only the probabilities but the state structure to represent information in various types of documents. These approaches also treat the document as a linear set of objects, first parsing it into tokens such as HTML tags and text. The state structure of the HMM is either hand crafted, or is learned from the set of training examples using stochastic optimization. These models have been quite successful at extracting information from semi-structured documents, such as academic papers, but their usefulness on HTML documents is unproven.

Other models learn to classify data by assuming that "nearby" elements in a hierarchy should be classified similarly. For instance, the document tree on a file-system or web-site is useful for inferring which documents are related. If most documents in a certain directory sub-tree have been classified in a certain way, new documents appearing in nearby sub-trees are more likely to be similarly classified.

Probabilistic Context Free Grammars, or PCFGs, are a statistical technique for semantically tagging semi-structured data. When used to extract semantic content from English sentences, a probabilistic model is learned by parsing tagged training examples and noting the frequency of occurrences of certain context-free rules among the training data. This model may then be used to tag new sentences by finding the most likely set of rules which could have created that sentence in the model. With the addition of "semantic" tags to the training examples, PCFGs may be used to label phrases with semantic meaning, the first step in higher-level information extraction. For example, PCFGs have been successfully used to extract the post, company, person entering and person leaving the post from newspaper texts describing corporate management successions.

One example of an interactive system for learning patterns on various types of documents is LAPIS. This system provides an interactive interface where users may specify examples relevant to a pattern by highlighting them. Patterns are constructed using a language called text constraints, which includes operators such as before, after, contains, and starts-with. By using a pre-defined library of parsers which tokenize and label the document, users can create patterns of arbitrary complexity, or allow the system to infer them from examples. This inference is performed by constructing a dictionary of region sets, which describe areas of the document which match certain tokens from the parsers. By analyzing the overlaps and repetitions of these region sets, LAPIS extracts its structured text constraints patterns. The results of matching these patterns are then displayed for the user, allowing them to perform such tasks as simultaneous editing and outlier finding. While it currently has no ties to the Semantic Web, LAPIS is a powerful pattern induction system, and our system will take advantage of its parsing abilities in cases where our tree edit distance algorithm is not applicable.

III.1 Text Mining

Most of the information is in the text form, it can be data on the web or library data or electronic books. The biggest problem with text data is that it is not structured as relational data. In some cases, text is structured or semi-structured. Semi-structured data can be an article with structured format like: title, author abstract and unstructured paragraphs.

Information retrieval and text processing have been with us for a long time now. Some of them are very sophisticated and can find documents by specifying attributes or keywords. There are also text processing systems that can find associations between documents.

Text mining is defined as mining text data. It is all about extracting patterns and associating unknown content from text databases. The difference between text mining and data retrieval is the same as the difference between data mining and query processing. Query processing and information retrieval needs specific data item, as in the case of mining higher level concepts in many items. The newest information retrieval and text processing tools find associations between words and paragraphs, so they can be seen as text mining tools.

Now we will examine the approach with text mining. Current tools and techniques for data mining work on relational databases. Data in object-oriented databases, rarely hear about data mining tools on that data. So, current mining tools cannot apply to text data. The current direction in mining of unstructured data includes these steps:

* Extract data and metadata from unstructured databases by using tagging techniques, stored that data in structured databases and apply data mining tools on structured databases.

* Integrate data mining techniques to information retrieval tools so appropriate data mining tools can be developed for unstructured databases.

* Develop data mining tools to operate on unstructured databases.

Figure 3-1: Converting unstructured data to structured data

When text data is converted to relational databases, there has to be carefully not to lose critical information. As said before, when the data is not good, the process of mining will not be efficient and it won¿½t result of useful data. First, it is required to create a warehouse before mining the converted database. This is essentially a relational database which has the essential data from the text. It means that, a transformer is required to take e text corpus as input and outputs tables, for example the keywords form the text.

In text databases with several articles, it is possible to create a warehouse with tables which contains following attributes: author, date, publisher, title, and keywords. The keywords can be different and the job of the data miner is to make association between them. Extending this to extract of keywords and concepts is shown in Figure 3-1.

A big effort has been given for information retrieval to augment the system to perform text mining. This is a product of attempts in improving information retrieval. Many companies have produced products to identify frequent concepts in documents as a means to organize documents and improve information retrieval. This is very useful information as opposed to simply as help in information retrieval.

Another approach when mining text directly has been used on problems of text classification and text clustering. There are several examples and groups have competed in solving text mining problems centered on a corpus where documents have been classified into topics. The direct approach is proven effective for classification and clustering, some attempts to obtain other types of data mining results directly from unstructured data have had no success. Tries that see documents as sets of words or phrases, loose too much information and produce many meaningless results. Must be developed mining techniques that properly model the ordered flow of concepts in text.

III.2 Image Mining

Text mining is in the first stage, but image mining is even further. Now I will examine the challenges of this technique. Image processing is quite used in lot of applications as medical imaging for detecting cancer, satellite image processing for space applications, hyper-spectral images, etc. Images include many entities such as maps, geological structures, biological structures and others. It deals with areas like abnormal patterns detection with deviation from the norm, retrieving images from the content and pattern matching.

If image processing is focusing on detecting abnormal patterns and retrieving images, then image mining is all about finding unusual patterns. Image mining deals with making associations between different images from image databases.

The first try for image mining was on 1977 and the plan was to extract metadata from images and carry out mining to metadata. This was essential mining the metadata in relational databases, but later it was discovered that images can be mined directly. In this case the challenge is to determine what type of mining outcome is most suitable, wherever to mine for associations between images, cluster images, classify images, and detect usual patterns or to mine a sequence of images and find out whether there are any unusual changes. But, the mining tools don¿½t tell why the changes are unusual.

Detecting unusual patterns is not the only outcome of image mining and it has been tried to identify recurring themes in images, both at the level of raw images and with higher-level concepts.

But, still this is just the beginning. It is required to conduct more research on image mining to check wherever data mining techniques can be used to classify cluster and associate images. Image mining is topic with applications in numerous domains including space, medical and geological images.

III.3 Audio Mining

Audio and video are continuous media type, so the techniques for audio and video information processing and mining are the same. Audio can have different forms, like: radio, speech or even spoken language. The TV news also has audio and it is integrated with video and maybe text for titles or other information.

Mining audio data, it can be converted into text with speech recognition software and other techniques like extracting keywords and mining the text data as shown in Figure 3-2. Audio data can also be mined directly by using audio information processing techniques and then mining selected audio data.

Figure 3-2: Mining text extracted from audio

Generally audio mining is simpler than video mining, but still not much research has been done about this topic.

III.4 Video Mining

Video mining is even more complicated than image, audio or text mining. We can see video as a collection of moving images. This is a subject of a lot of research. Important areas are developing query and retrieval techniques for video databases, including video indexing, query languages, and optimization strategies. There is no clear picture in video mining, unlike image or text mining. Video clips can be examined and a lot of common things can be found between clips or it can be used to find unusual patterns in video clips. The first step in successful video mining is handling good image mining.

Figure 3-3: Image mining

Now I will show pattern matching in video databases. To be consistent in the terminology, it is possible to find correlation and previously unknown patterns from video databases in video mining. When a video clip or multiple video clips are analyzed, we can conclude some unusual behavior.

If an object in a video occurs to be there more times, this means that it is something significant. When capturing the text in video format and making the associations it is possible to transmit the text, but this time use the video data.

There is not much of information on analyzing video data. When using these summaries, which would amount to mining text as shown in Figure 3-4. To convert the video mining problem to a text mining problem, it is reasonably well understood. But, the challenge to mine video data directly and knowing what to mine is a big challenge. Direct video mining is becoming very important with the emergence of the web.

Figure 3-4: Mining text extracted from video

It has been done a work to categorize based on characteristics of the video, rather than the associated text. It is possible to identify features based on cinematic principles (length of scenes, scene changes, etc) and use it as a classification input. Different approaches for summarizing have been suggested and they involve key frames or scene categorization.

Like text mining, in the case of video mining, a lot of work has been done to retrieve video. This identifies video segments which users can watch and this is generated by the user watching that video. Further, it is required to understand what types of knowledge need to be gained from video mining. In some cases, this can be straightforward, but to identify a wide set of applications for video mining is required before the research in this area will make the next step forward.

III.5 Mining different data types

Previously I have described the mining process of individual data types, like: text, images, audio and video. But, when mining multimedia, it is required to mine combinations of two or more data types.

To handle combinations of data types, is a very difficult process in resembles with dealing with different databases. For example, each different database¿½s environment contains data which belong to multiple data types. These databases can be integrated and mined, or it is possible to apply mining tools on the individual databases and combine the results of the various data miners. This is illustrated in Figures 3-5 and 3-6.

Figure 3-5: Mining and then integrating

In both cases the Multimedia Distributed Processor has an important role. If the data is integrated before it is mined, then it is carried out with the MDP. If the data is mined first, the data miner increments the multimedia databases¿½ management system and the results of the data miners are integrated with the Multimedia Distributed Processors.

Figure 3-6: Integrating and then mining

Because there is a lot to be researched about mining of individual multimedia data types: text, images, audio and video; there is even more work to be done about mining of different multimedia data types. First, it is required to handle well individual data types, and later to mine them altogether.


In this seminar paper I started with a definition of multimedia database management systems and later gave an overview of these systems. There are different types of architectures, data models and functions of these systems which I have described here in details and later I addressed data mining for multimedia data. Here I have focused on the four types of multimedia data types: text, images, audio and video. I have defined what data mining means to such data types and discussed further the development and challenges. At the end, I discussed the issues when mining different multimedia data types.

Mainly, I have addressed mining of individual data types and not in combination. But in most cases it consists of combination of two or more media types. With the development of multimedia data types mining, it is expected that also the mining of combination of different data types will be achieved.

There are two requirements for multimedia mining to become complete:

1. Mining techniques that models order as part of some data ¿½ If multimedia sequence is ignored, then too much data is lost. Both approaches of mining ordered data, time series and event sequences are not sufficient for multimedia. It is good to capture order in the result; example discovered patterns can include ¿½first pattern before the second one¿½.

2. Comparing objects that are represented differently ¿½ Pictures taken from different angles or photograph and drawing capture similar information. But, various representations lead data mining algorithms to overvalue the differences between some objects. When algorithms recognize similarities in data between two objects, then maybe the objects are the same. However, different data don¿½t say that objects are different; differences how the data are captured may cause same objects to be represented by very different sets of data. Mining techniques must handle this issue.

If a progress is made on multimedia data management and data mining, the lot of tools will emerge on mining multimedia data. Today¿½s data mining tools work only on relational databases, but it is expected that in the future will be developed multimedia data mining tools also as tools for mining object databases. Lot of research is required for this to be achieved.