This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Since the last decade, there has been a striking increase in the level of attention paid to the subject of ontology. Building coherent ontologies requires a lot of effort and knowledge about the domains of interest. Different groups of experts and sometimes individuals may contribute in the ontology development process. In addition, various methods, languages, and tools can be used according to specific requirements and under certain conditions or circumstances. A natural consequence of the ontology development process is the creation of heterogeneous ontologies that describe different domains or even the same domain, making it quite difficult to interoperate among heterogeneous systems.
In this chapter, we first present the fundamental concepts about ontologies (Section 2.1). We give a brief overview of ontology components and languages in sections 2.2 and 2.3 respectively. In section 2.4 we explain the semantic heterogeneity problem and discuss it at both the ontology language and component levels. In section 2.5 we summarize and clarify the various definitions of the concepts and terms that are used in the field. Then, we provide a detailed description of the ontology mapping, matching, and merging techniques, which are used to i) find the semantic correspondences between the entities of heterogeneous ontologies and ii) combine the source ontologies into a single coherent ontology (Section 2.6). Finally, we describe the key role that ontologies play in semantic search and retrieval systems/approaches (Section 2.7) and conclude the chapter in section 2.8.
What is Ontology?
There are many definitions of the word "ontology" in literature. However, Gruber's definition is the most widely referred to and quoted by the ontology community. He defined ontology as: 'an explicit specification of a conceptualization' . Later in 1998, Studer and his colleagues have explained this definition as: 'An ontology is a formal, explicit specification of a shared conceptualization. Conceptualization refers to an abstract model of some phenomenon in the world by having identified the relevant concepts of that phenomenon. Explicit means that the type of concepts used, and the constraints on their use are explicitly defined. Formal refers to the fact that the ontology should be machine-readable. Shared reflects the notion that an ontology captures consensual knowledge, that is, it is not private of some individual, but accepted by a group' . From a technical perspective, Sowa defined ontology as: 'a catalog of the types of things that are assumed to exist in a domain of interestÂ DÂ from the perspective of a person who uses a languageÂ LÂ for the purpose of talking aboutÂ D. The types in the ontology represent theÂ predicates,Â word senses, orÂ concept and relation typesÂ of the languageÂ LÂ when used to discuss topics in the domainÂ D ' .
Generally, an ontology is organized in a hierarchical form . The concepts of the ontology are related by different types of relations such as specific/generic, part-of, etc. Principally, there are two types of ontologies:
Domain ontologies: these ontologies provide formal descriptions of the concepts and theÂ relationships among those concepts that describe an application area or a particular domain such as tourism, sport, and medicine. Examples of these ontologies are MeSH  , GeoNames  , and Gene  .
General-purpose ontologies: these ontologies contain general knowledge about several domains. They provide descriptions about domain-independent entities and facts about those entities. Examples of these ontologies are YAGO2  and Cyc  .
Both types of ontologies have been extensively used to support various intelligent applications such as semantic search , Information Extraction (IE) , Query Expansion (QE) , Information Retrieval (IR) , etc. In the context of this thesis, we show how ontological background knowledge represented by both types of ontologies can be reused to support semantic search and retrieval capabilities of meta-search engines on the Web. In other words, we reuse ontological background knowledge to i) reformulate users' queries that are submitted to the meta-search engine, and ii) to semantically match the reformulated queries to the returned results by the meta-search engine. As we mentioned earlier, it is important to overcome the two major problems (semantic heterogeneity and semantic knowledge incompleteness) associated with existing ontologies before actual reuse of the ontological background knowledge that they represent. To do so, we use the proposed ontology merging and enrichment framework to combine heterogeneous domain-specific ontologies into single coherent ontologies on the one hand, and to automatically enrich the merged ontologies, as well as other general-purpose ontologies on the other.
Domain-specific and general-purpose ontologies can be modeled using different knowledge modeling techniques. In addition, they can be formally implemented in various ontology languages. Despite the differences in the used modeling techniques and languages, there exists a common set of knowledge modeling components that are shared among both types of ontologies. These components are: classes a.k.a. concepts, relations, and instances a.k.a. individuals.
OPEL Corsa GSi
Ford Focus ST
Figure â€Ž2â€‘1: Part of Transportation Ontology
Classes of the ontology represent the concepts of the domain that the ontology describes. They can represent abstract concepts (e.g., feelings, beliefs, etc.) or specific concepts (e.g., organizations, people, countries, etc.). For example, in Figure 2-1; which shows part of an ontology that describes the transportation domain, the concepts (Automobile, Motor Vehicle, Vehicle, etc.) are example of the ontology classes. These classes are shown as rectangles.
Relations are used to represent how the concepts of the ontology are related to each others. For example, in Figure 2-1, the concept Vehicle is related to the concept Motor Vehicle through "is-a" relation, and the concept Air Bag is related to the concept Vehicle through "has-member" relation.
Instances are used to represent individuals of the ontology concepts. For example, Ford Focus ST and OPEL Cosa GSi are instances or individuals that belong to the concept Automobile in the transportation ontology.
An ontology language is used to express and formally implement ontologies. There is a large variety of languages that are used for this purpose . In spite of the variety of ontology languages, they are used to express (with different levels of expressiveness) a common set of knowledge modeling components (those are mentioned in the previous section) in the ontologies. The syntax of ontology languages is based on existing mark-up languages such as XML. Some examples of ontology languages are OWL (Web Ontology Language); which is recommended by the W3C, and RDF (Resource Description Framework). Figure 2-2 shows an example of the syntax of OWL language. In this example, only "is-a" relation is used between the concepts of the ontology.
<rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A bibliographic ontology for the form and content of bibliographic descriptions from the viewpoint of libraries. See the publication of a set of cataloguing principles by the International Federation of Library Association and Institutions (IFLA).
Figure â€Ž2â€‘2: Part of an Ontology in OWL Syntax
OWL language has three sublanguages, OWL Lite, OWL DL, and OWL Full. Details on these sublanguages are described below:
OWL Lite: is the light version of OWL. It provides the basic constructs to build an ontology. The restrictions imposed on OWL Lite make it easy for ontology engineers and users to construct simple hierarchical structures in OWL with lower computational complexity than other OWL sublanguages.
OWL DL: it provides more expressive power than OWL Lite. It supports users who want to maintain computational completeness (all conclusions are guaranteed to be computable) and decidability (all computations will finish in a finite time) while using the maximum expressiveness of the language.
OWL Full: the computational complexity of OWL Full is higher than the other two sublanguages. However, this sublanguage does not guarantee computational completeness and decidability as offered by OWL DL. Therefore, applications that do not require guaranteed conclusions and need more language expressivity can use OWL Full.
Ontological Semantic Heterogeneity
A growing number of ontologies are being developed by different groups of experts and individuals. Since this process is distributed in nature, it has resulted in the creation of several similar and overlapping ontologies that are used to describe the same domain. This problem is known as the semantic heterogeneity problem between domain-dependent ontologies. Ontologies can be heterogeneous at two different levels of analysis. The first case of heterogeneity is caused by the languages that are used to implement ontologies. The main reason behind this is because those languages use different syntaxes to represent the various components of the ontology. Therefore, we may find two or more ontologies that are used represent the same domain with totally different languages.
The second case of heterogeneity is associated with the ontology component level. The main factors that result in the semantic heterogeneity at this level are:
1. Conceptual heterogeneity: occurs between two or more ontologies that are used to describe the same subject area where one ontology may contain a definition of a particular concept and the other ontology(ies) may ignore that concept.
2. Terminological heterogeneity: happens when using different names to describe the same concepts. For example, to describe the concept (motor vehicle with four wheels), one ontology may use the name "car" while others may consider the name "automobile" to refer to the same concept.
3. Encoding heterogeneity: refers to using different formats to encode values in the ontology. For example, the date format may have different encodings in different ontologies.
The abovementioned semantic heterogeneity factors and the large number of existing ontologies form a major challenge to ontology-based systems. Regardless of whether the system is using domain-specific, general-purpose, or both types of ontologies, semantic heterogeneity problem should be tackled through combing the heterogeneous ontologies together or through bringing them into a mutual agreement. As a consequence, the combined or merged ontology will reflect a consensus view about the modeled domain of interest and hence can be easily shared and reused by various ontology-based systems/approaches.
Before we discuss the various ontology combination techniques, it is important to understand the terminology used in this field. In this section, we summarize and clarify the various definitions of the concepts and terms that are used when combining ontologies. These concepts and terms are defined according to , , and as follows:
Ontology Matching: is the process of finding relationships or similar parts between entities of different ontologies using several matching techniques. The output of this process is known as Alignment.
Ontology Alignment: is the set of correspondences between two or more source ontologies. In this context, the source ontologies are brought into a mutual agreement and made consistent and coherent, but kept separately.
Ontology Mapping: is the process of mapping the entities of one ontology to at most one entity of another ontology by an equivalence relation.
Ontology Merging: refers to the combination of two or more source ontologies that are used to describe the same subject area into a single coherent ontology.
Ontology Integration: refers to the combination of two or more ontologies from different domains into a new ontology.
To address the semantic heterogeneity problem between heterogeneous ontologies we will consider the abovementioned terms. Our goal is to employ knowledge represented by merged domain-specific ontologies and other general-purpose ontologies to semantically enhance the retrieval effectiveness of traditional Web meta-search techniques.
The ontology merging process comprises of several sub-processes as follows. First, we find 'matches' between the entities of the heterogeneous domain-specific ontologies. Then, based to the identified matching entities, the ontologies are merged together into a single coherent ontology. We further update the produced merged ontologies in addition to other general-purpose ontologies through enriching them with additional semantically related entities. These entities can be automatically extracted from the vast amount of information encoded in texts on the Web. Our final goal is to reuse ontological background knowledge represented by the updated ontologies to semantically support Web meta-search engines, and hence improve their retrieval effectiveness.
In the next section, we describe the various techniques that are used to integrate and combine heterogeneous domain-specific ontologies.
Ontology Combination and Integration Techniques
Several ontology combination and integration techniques have been proposed to solve the semantic heterogeneity problem. These techniques aim at i) finding semantic correspondences between the entities of heterogeneous ontologies (these techniques are also referred to as ontology mapping and matching techniques), and ii) merging or integrating them into a new coherent ontology (these techniques are also referred to as ontology merging and combination techniques). Considering the ontology mapping and matching techniques, Choi and his colleagues classify the usage of these techniques into three different categories as follows:
The first category is concentrated on finding mappings between an integrated global ontology and local ontologies. This category supports ontology integration by describing the relationship between an integrated global ontology and local ontologies. In this case, ontology mapping is used to map a concept found in one ontology into a view, or a query over other ontologies.
The major strength of this mapping category is that it is easier to define mappings and find mapping rules than in mapping local ontologies. This is because an integrated global ontology provides a shared vocabulary and all local ontologies are related to a global ontology. However, this mapping requires an integrated global ontology, which is a major drawback; in the sense that there exists no one single integrated ontology that covers concepts in all different domains.
The second mapping category focuses on mapping between local ontologies. This type of mapping enables interoperability for highly dynamic and distributed environments and can be used for mediation between distributed data in such environments. In this context, the source and target ontologies are both semantically related at a conceptual level where source ontology entities are transformed into target ontology entities based on the identified semantic relations between them. This avoids the complexity and overheads of integrating multiple sources. However, because of the lack of common vocabularies among local ontologies, finding mappings between them may not be easier than between an integrated global ontology and local ontologies.
The third mapping category is associated to ontology merging and combination. It is used as a part or prerequisite of the ontology merging or combination process. In this context, ontology mapping establishes correspondences between the entities of local ontologies to be further considered in the merging or combination process. Additionally, it determines the set of overlapping, synonyms, or unique concepts that are related to the source ontologies.
In the context of our work, we focus on the third category where the produced mappings or 'matches' between the entities of the source ontologies will be further exploited as a part of the ontology merging process. Several techniques can be used to obtain the set of mapping entities a.k.a. 'matches'. We distinguish in the literature two types of ontology matching techniques. These are syntactic-based and semantic-based techniques. In the following section we provide a detailed description of these techniques.
At this level, matching is computed between the entities of the source ontologies without taking into account the semantic relationships between them. Syntactic-based techniques can be further differentiated into string-based, language-based, and instance-based techniques.
These techniques are used to measure similarities between names and name descriptions of the various ontological components. They build on the assumption that the more similar are the strings of two entities, the more likely they denote the same entity. The result of measuring string distances is a non-negative real number where a smaller value of the real number indicates a greater similarity between the compared strings. Some examples of string-based techniques that are used extensively in ontology matching systems are edit-distance and substring or n-gram techniques. In edit-distance technique, the minimum cost of edit operations required to convert a string (s) into a string (t) is calculated. While in n-gram or substring technique, the number of common n-grams or (substrings) between strings (s) and (t) is computed. For example, for n=3, trigram(3) for the string "article" are "art", "rti", "tic", "icl", "cle". Therefore, the distance between the string "article" and the string "particle" would be 5/6.
The main problem of string-based techniques is the lack of semantic characterization of ontology entities. For example, the concepts "car" and "automobile" are considered to be not similar with regard to string-based techniques, while they are synonyms i.e. they refer to the same concept and have the same meaning.
It is important to mention that some approaches utilize Natural Language Processing (NLP) techniques in order to improve the results of string-based similarity measures. Some examples of these techniques are: tokenization; where strings are compared as multi-sets of tokens, stopword removal (e.g., the removal of prepositions and articles), and stemming, i.e., reducing words to their roots (e.g., the word 'playing' is stemmed to 'play').
These techniques deal with the instances of the ontology entities. The intuition behind using such techniques is that for two source ontologies having the same or similar instances we can find corresponding concepts. Similarity between instances is measured using string-based techniques and other common measurements. If the similarity measure is greater than a certain threshold value then the two instances are considered as a match, i.e., equivalent. Accordingly, similar concepts are then identified based on the results of matching the instances of the source ontologies.
In practice, ontology matching systems propose further enrichment of the syntactic-based techniques with the use of semantic-based techniques. In this context, semantic correspondences between the entities of heterogonous ontologies are also obtained based on their semantic similarity. The following are examples of semantic-based techniques.
In this technique ontologies are viewed as labelled graphs which contain terms (entity names) and their inter-relationships (e.g. is-a, part-of). The similarity between ontologies' entities is measured based on the positions of these entities within the graphs. Mainly, it is based on the following assumptions:
First, if the direct super-concepts and/or the direct sub-concepts of two concepts are similar, the two compared concepts maybe also similar.
Second, if two concepts from the two ontologies are similar, their neighbours might also be semantically related.
These technique considers ontologies as labelled graphs as well, but they take into account only the "is-a" relation between graph nodes (ontology entities) for similarity measures. The intuition behind is that, "is-a" links nodes that are similar, therefore, their neighbours maybe also similar.
Use of External resources and Upper Level Formal Ontologies
In addition to the abovementioned syntactic and semantic based techniques a domain-specific ontology, general thesauri, or an upper level formal ontology can be used in order to find matches between the entities of heterogeneous ontologies. For example, WordNet is used as a ground on which comparisons can be based for initiating the mappings between ontologies' entities. For instance, if we have the concept "car" in ontology_1 and the concept "automobile" in ontology_2, using WordNet we find that these two concepts are synonyms.
Domain-specific ontologies are also used in this context to describe details of the domains of interest. Generally, such details are missing or not covered by general-purpose ontologies. For example, in the computer domain concepts "U.S.B" and "Universal Serial Bus" are synonyms. This can be inferred using a computer domain-specific ontology.
On the other hand, upper level formal ontologies are usable resources that aim at providing an agreed-upon set of concepts and formal specification of other ontologies. In this context, application developers can use and extend these general upper level ontologies with concepts and properties specific to their applications. Examples of upper level formal ontologies are the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) and the Suggested Upper Merged Ontology (SUMO) .
Figure 2-3 shows a classification of the discussed ontology matching techniques.
Figure â€Ž2â€‘3: Classification of the Matching Techniques,
Reuse of Ontological Background Knowledge to Semantically Extend Keyword-based Search Models
Despite the significant improvement in search effectiveness of current Web search engines, they often fail to meet the user information needs. This is because these search engines use query terms as syntactic descriptors to rank the content of WebPages. In this context, the search task is built on keyword-based matching (enriched with some IR models such as tf/idf and Google PageRank ) between the query terms and the engine's database. A major problem of classical keyword-based matching models is that they do not provide means for the identification of semantically related concepts that must be contained in the list of returned results. For instance, a user looking for programming languages might not only be interested in documents which talk about programming languages but also in those which talk about the various kinds of programming languages such as java and C#. Other problems of existing keyword-based search models are the polysemy and synonymy problems. Polysemy refers to fact that a query term may have multiple meanings and, therefore, query results may contain documents where the query term is used in a meaning which is different from what the user had in mind when he/she was defining the query. On the other hand, synonymy reflects the case where two different terms can express the same meaning in a given context. For example, the query terms car and automobile refer to the same meaning and, therefore, if the user defines the term car in his query, all document containing its synonym, i.e., automobile will not be retrieved.
In order to address the problems of classical keyword-based search models, several approaches/systems proposed to extend those models with semantics . In this context, the search for information is based on the meaning of words or phrases rather than merely the presence of keywords within the document collection. In this context, external sources of semantic information represented by ontological background knowledge have been exploited to improve the effectiveness keyword-based search models. The work presented by Voorhees was among the first attempts to use ontological lexical-semantic relations (represented by WordNet general-purpose ontology) to disambiguate query terms . To disambiguate an ambiguous term, the synsets (i.e. senses) of that term were ranked based on the amount of co-occurrence between the term's context and terms in the hood of its synsets. She defined the hood of a term sense contained in a synset s as follows. "To define the hood of a given synset, s, consider the set of synsets and the hyponymy & hypernymy relations in WordNet as the set of vertices and directed edges of a graph. Then the hood of a given synset s is the largest connected subgraph that contains s, contains only descendants of an ancestor of s; and contains no synset that has a descendent that includes another instance of a member of s as a member". Her conclusion, supported by experimental results, showed a significant improvement in performance for a set of manually selected expansion terms.
Another example of the systems that used ontologies for semantic search purposes is the systems proposed by Gauch and colleagues . The proposed system used subject hierarchies provided by online portals such as Yahoo.com and About.com as a reference ontology to support personalized Web semantic search.
In 2007, Tran and her colleagues, proposed to use the SWRC ontology for interpreting the query's keywords in order to support semantic search . Their proposed system, XXploreKnow!, used the SWRC ontology as its underlying repository of knowledge to highlight and match the keywords of the user query to a set of entities that are defined in the used ontology. Matching entities are then displayed to the user to help him/her in reformulating the submitted queries in order to obtain a better list of query results.
A major problem of the abovementioned systems lies in the fact that some of the query's keywords can't be mapped to their equivalent ontological entities (concepts, instances, relations). This is because of the limited domain coverage by the exploited ontologies. To overcome this problem, recent systems propose to use multiple ontologies to ensure providing a more comprehensive coverage of various domains. For instance, in 2009, Wimalasuriya & Dou used multiple ontologies in specialized domains for Information Extraction (IE) purposes . Their experimental results showed that by using multiple ontologies precision of the system had significantly improved. Although the aim of this work was to tackle a different problem; that is IE, it inspired us to use multiple ontologies to support Web semantic search purposes. In this context, we used multiple ontologies to support semantic search capabilities of MultiSearch meta-search engine . Particularly, we reused ontological background knowledge represented by multiple general-purpose ontologies such as YAGO and the open source version of the Cyc ontology to derive the semantic aspects of the user's query on the one hand, and to semantically rank the returned search results by individual search engines on the other. Experimentally, we evaluated the quality of the produced results by MultiSearch meta-search engine when using a single ontology vs. multiple ontologies. The produced results showed that by using multiple general-purpose ontologies the precision of the system was improved. However, it is important to mention that we faced the problem of semantic knowledge incompleteness in the used ontologies. In other words, we found that, for some queries, we were not able to map their terms to any of the used ontological entities that are represented by the used ontologies. To address this problem, we utilized statistical-based semantic relatedness measures. The intuition behind using those measures was twofold: 1) to discover the strength of the semantic relatedness between the missing entities from the used ontologies and those that are defined in them, and 2) to enrich the reformulated query with additional semantically related entities (i.e., those which were not defined in any of the used ontologies) that we believe they may significantly lead to a better understanding of the query's intent. Although the used measures provided us with the ability to enrich the query's semantics with additional semantic knowledge, we believe that exploiting additional domain-specific ontologies (produced by the proposed ontology merging and enrichment framework that we discuss in Chapter 4) can significantly improve the precision of MultiSearch meta-search engine. Therefore, in the extended version of MultiSearch, called: Multi-Search+, we exploited both types of ontologies (general-purpose and domain-specific) to provide broader and deeper coverage of various domains. Further details on Multi-Search+ are provided in Chapter 5.
In this chapter, we have presented the fundamental concepts required for the understanding of ontology. We have elaborated the different kinds of semantic mismatches that could happen between domain-specific ontologies and identified the solutions to reconcile those mismatches.
We have provided a detailed discussion on ontology matching and merging techniques and classified them into syntactic-based and semantic-based techniques. We further clarified the differences between those techniques and elaborated the intuitions behind using them.
Another key issue that we have focused on is the important role that ontologies play in semantically extending the search capabilities of traditional keyword-based search models. We have explained that ontological background knowledge can be reused to i) help users to formulate their queries and ii) to reduce the "semantic gap" between the meanings of the set of keywords that are used in his/her query and those that are used to index the document collections in the search engines' databases.