Data mining and semantic web

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


  Data Mining and Semantic Web are two different avenues leading to the same goal that's efficient retrieval of knowledge, from large compact or distributed databases, or the Internet. Knowledge in this context means synergistic interaction of information (data) and its relationships (correlations) but the major difference is placement of complexity.

These two approaches have their own advantages and disadvantages and can be integrated to each other to diminish both their drawbacks.

In this paper integration of semantic web and data mining in different field of application explored.

1 Introduction

   Information resources in the web are mostly in text format and are natural language documents that are suitable for human consumption. But there is a problem with this kind of web; Web content is not machine-accessible. Search engines try to establish connections between documents but there are serious problems associated with their use such as "High recall, low precision", "Low or no recall", "Results are highly sensitive to vocabulary" and other problems.

Outgoing efforts for overcoming this problems and structuring web content in order to query them and retrieve information is in two directions.

One solution is to use the content as it is and works on information retrieval and text mining categorizing textual resources. The approaches either (i) predefine a metric on a document space in order to cluster 'nearby' documents into meaningful groups of documents (called 'unsupervised categorization' or 'text clustering';) or (ii) they adapt a metric on a document space to a manually predefined sample of documents assigned to a list of target categories such that new documents may be assigned to labels from the target list of categories, too ('supervised categorization' or 'text classification';).[1] In this approach data and knowledge represented with simple mechanisms, typically HTML, and without metadata. In data mining relatively complex algorithms have to be used such as decision trees; rule induction ... This method has its advantages and problems. The advantage is that document categorization with this method is nearly cheap but the problem is that the qualities of its document categorization for larger sets of target categories as well as the understandability of its results are often quite low.

An alternative approach is to represent Web content in a form that is more easily machine-processable with the use of semantic Web. Data and knowledge represented with complex mechanism, typically XML, and with plenty of metadata. In this approach thesauri and ontologies that are conceptual structures are constructed. Advantage of this approach is that the quality of manual metadata may be very good and relatively simple algorithms can be used with low complexity at the retrieval request time, but the cost of building ontology and adding manual metadata typically are one or several orders of magnitude higher than for automatic approaches and has large metadata design and maintenance complexity at system design time.[2]

First approach can be mentioned as data mining approach and the second as the use of Semantic Web. These two approaches can be integrated to each other to diminish both their drawbacks.

In this paper some usage of semantic web and data mining is presented. In the next section an ontology-based framework for text mining is introduced. In section 3 applications of semantic web and data mining in healthcare is discussed. In section 4 the paper is concluded.

2 An ontology-based Framework for Text Mining

   This framework is constructed by S. Bloehdorn, P. Cimiano, A. Htho and s. Staab[1] that uses text mining to learn the target ontology from text documents and uses then the same target ontology in order to improve the effectiveness of both supervised an unsupervised text categorization approaches.

The architecture builds upon the Karlsruhe Ontology and Semantic Web Infrastructure (KAON) that's a general and multi-functional open source ontology management infrastructure and tool suite developed at Karlsruhe University. In this framework some definitions of ontology is given that define the core ontology, sub concepts and super concepts, domain and range, lexicon for an ontology and knowledge base. The main component of the framework that is responsible for creating and maintaining ontologies is "TextToOnto". It employs text mining techniques such as term clustering and matching of lexico-syntactic patterns as well as other resources of a general nature such as WordNet[1]. It has three main components: Ontology Management Component that provides basic ontology management such as editing and browsing and evolution of ontologies. The second component is the Algorithm Library Component that incorporates a number of text mining methods. The third component is Coordination Component that is used to interact with the different ontology learning algorithms from the algorithm library.

2.1 Ontology-based Text Clustering and Classification

   The demand of systems that automatically classify text documents into predefined thematic classes or detect clusters of documents with similar content is very urgent due to the ever growing amount of textual information available electronically. Existing text categorization systems have typically used the Bag-of-Words model that is a model in information retrieval where single words or word stems are uses as features for representing document content. In this paradigm documents are represented as bags of terms. The absolute frequency of term t in document d is given by tf(d,t) and Term vectors are denoted td = (tf(d, t1); : : : ; tf(d, tm)).

To exploit background knowledge about concepts that is given according to the ontology model, term vectors extended by new entries for ontological concepts c appearing in the document set.

The process of extracting concepts from texts has five steps: 1. Candidate Term Detection that's an algorithm that maps multi-word expression to the most appropriate concept.2. Syntactical Patterns that uses part-of-speech tags of the words3. Morphological Transformations 4. Word Sense Disambiguation 5. Generalization: The last step in the process is about going from the specific concepts found in the text to more general concept representations.

3 Semantic Web and data mining in Healthcare

   This section discuss about use of semantic web and data mining in health care. First part discuss about overall usage, 3.2 discusses about using semantic dependencies to mine depressive symptoms from consultation records and 3.3 discusses about the requirements for ontologies in medical data integration.

3.1 Overview

   The Web has become a major vehicle in performing research and practice related activities for healthcare researchers and practitioners, because it has so many resources and potentials to offer in their specialized professional fields. []. There is tremendous amount of information and knowledge existing on the Web and waiting to be discovered, shared and utilized. The research in improving the quality of life through the Web has become attractive. Both healthcare researchers and practitioners require a lot of information to make their healthcare related activities and practices either with drug prescriptions which can effectively cure patients' illness or with correct and efficient medical/clinical procedures and services. Information technology has been playing an important and critical role in this field for many years. By using the Semantic Web and mining technologies, not only can researchers and practitioners in healthcare from different countries share their information by exchanging the XML-based ontology, but they can also effectively collaborate on healthcare research projects and work closely together as a team. By focusing on the semantic based information, they will have better access to the knowledge and information required to effectively prescribe drugs and medical procedures to prevent/treat dangerous and infectious diseases. Researchers and practitioners in healthcare have access to the databases of the latest diseases, their symptoms, treatments, diagnosis analysis and other important information. This kind of information can be structured in a more understandable and machine interpretable way by using Semantic Web languages. If this is done successfully, then this ontology or RDF can be fed into an inference engine, which can effectively make new discoveries useful to the patient treatment procedures or the general healthcare activities. Ontologies play a key role in describing semantics of data in both traditional knowledge engineering and emerging Semantic Web. Since ontology defines the exact nature of every resource in its domain and the relationship among these resources, it becomes much simpler to extract the users' needs and usage tendencies.

3.2 Using semantic dependencies to Mine Depressive Symptoms from Consultation Records

   Many psychiatric Web sites have developed various psychiatric screening services for mental health care and crisis prevention that people can use these services to consult professionals about depressive symptoms, get a preliminary assessment of their symptoms' severity, and receive health education via email or other communication media. Analyzing consultation records and making suggestion with the current systems take a lot of time of professionals. Semantic web can help so much to solve this problem. The new system should has a service that first understand what kind of depressive symptoms people are experiencing and the semantic relations between symptoms; then it could offer further diagnostic and educational services. In [4] a framework is suggested for mining depressive symptoms and their relations from consultation records.

In this framework depressive symptoms are embedded in a single sentence or a discourse segment-that is, successive sentences describing the same depressive symptom. As the domain knowledge Hamilton Depression Rating Scale (HDSR) is used. Data mining methods are used to identify the symptom. The mining task is decomposed into subtasks:

  • Identify discourse segments by grouping the successive sentences with the same semantic label.
  • Discover semantic relations that hold between discourse segments.

In this framework semantic-dependency, lexical-cohesion, and domain-ontology knowledge sources are integrated to mine depressive symptoms and their relations. To identify the discourse segments, each sentence's semantic dependencies are modeled using a semantic dependency graph (SDG). In SDG head word of each sentence that is the central element to which other elements have some dependency relation, that is a relation between each word toke and its head in a sentence, is used to label sentences. SDG has semantic dependencies that provide the significant features for inferring a semantic label for each sentence. Four kind of semantic relations are discovered among the discourses:

  • Cause-effect-because, therefore
  • Contrast-however, but
  • Joint-and, also
  • Temporal sequence-before, after
  • The experiments in [4] shows that the framework identifies significant features for the task of mining depressive symptoms and heir semantic relations to support interactive psychiatric services. The semantic-dependency structure captures the intra sentential information, the lexical cohesion captures the inter sentential information, and the domain ontology models the domain knowledge. Integrating these knowledge sources is a promising approach to the mining task.

    3.3 The requirements for ontologies in medical data integration

       Information technology today is widely adopted in modern medical practice, especially supporting digitized equipment, administrative tasks, and data management. But computational techniques doesn't use much of this medical information in research or practice because the laws of medicine are knowledge based disciplines and rely greatly on observed similarities rather than on the application of precise rules. In [5] the Health-e-Child (HeC) project is conducted to demonstrate that indeed integrating medical integration in novel ways yields immediate benefit for clinical research and practice. It aims to develop an integrated platform for European Paediatrics, providing seamless integration of traditional and emerging sources of biomedical information as part of a longer-term vision for large-scale information-based research and training, and informed policy making.

    To have a vertical integration of data that is establishing a coherent view of the child's health to which information from each vertical level contributes, from molecular through cellular to individual, sharing data among spatially separated clinicians and information produced in different departments or multiple hospitals brings together for the purpose of creating statistically significant samples, studying population characteristics and sharing knowledge among clinicians. The emphasis of the Health-e-Child requirements process is therefore on "universality of information" and its corner stone is the integration of information across biomedical abstractions, whereby all layers of biomedical information are 'vertically integrated' to provide a unified view of a child's biomedical and clinical condition.

    Ontology is a formal specification of a shared conceptualization. This means that ontology represents a shared, agreed and detailed model of a problem domain. One advantage for the use of ontologies is their ability to resolve any semantic heterogeneity that is present within the data. Ontologies define links between different types of semantic knowledge. The fact that ontologies are machine processable and human understandable is especially useful in this regard. There are many ontologies in existence today especially in the biomedical domain, however they are often limited to one level vertical integration and it would not be sensible to reuse these ontologies in their entirety; so to make an appropriate ontology for Hec available ontologies are integrated bye the extraction of the relevant parts and then the integration of these into a coherent whole, thereby capturing most of the HeC domain but the missing attributes of Hec modeled sepratly. Integration process involves identifying similarities between ontologies in order to determine which concepts and properties represent similar notions across heterogeneous data samples in a (semi-)automatic manner.

    As mentioned above use of ontology and inference engine can aid in the area of query enhancement. It provides clinicians with more targeted information. Use of ontology enabled clinicians to take basic queries from users and translate them into more complex context aware searches and minimizes the load on the system as fewer searches are necessary. Query optimization also assists in this regard by using the HeC ontology to aid the creation of efficient data access paths by semantically altering the initial query to find a more efficient execution path within the database. Both query enhancement and optimization are crucial in delivery of intuitive data access for clinicians whilst at the same time ensuring the scalability and overall stability of the system.

    4 Conclusion

       This paper attempted to find application of semantic web and data mining in different fields. Observed application demonstrated that data mining methods can be very useful for ontology construction and the constructed ontology itself can be used for classification in data mining. Use of ontologies in healthcare has significant effect and cause having better standard of life.


    1. S. Bloehdorn , P. Cimiano1 , A. Hotho and S.Staab. An Ontology-based Framework for Text Mining. 2004
    2. V. Milutinovic. Data Mining versus Semantic Web,
    3. Weider D. Yu Soumya R. Jonnalagadda . Semantic Web and Mining in Healthcare
    4. Chung-Hsien Wu and Liang-Chih Yu. Using Semantic Dependencies to Mine Depressive Symptoms from Consultation Records
    5. A. Anjum, P. Bloodsworth, A. Branson, T. Hauer, R. McClatchey, K. Munir, D. Rogulin, J. Shamdasani. The Requirements for Ontologies in Medical Data Integration: A Case Study