This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Geographic Information System deals with cartographic information and its related metadata. The organization, that deals with geographic information need to implement GIS. During 1970s and 1980s such type of organizations developed their own GIS. Those GIS are developed basically based on the information which is necessary for their organization. These organizations collect their own data and rarely share their information with others .
Last 10-15 years internet contributes a dramatically change in human life. The world becomes closer. People can easily share information with others. Internet decreases inefficient and redundant data storage of old days. So, now a days GIS become interoperable GIS.
Moreover, the term open source becomes more popular for last few years. Through open source system, software can be integrated by different developers; and thus can build new software product to the market.
Currently there are large numbers of open source tools are available, which are related to different aspects of geographical information such as spatial databases, web map servers, analysis tools etc. GeoNetwork is one of them, which is the application domain of my thesis.
OGC (Open Geospatial Consortium) provides CSW(Catalog Service for Web) standard which describes a set of interfaces, bindings and encodings that can be implemented in catalogue servers that data providers will use to publish collections of descriptive information. The goal of INSPIRE (INfrastructure for SPatial InfoRmation in Europe) is to improve accessibility, interpretability of spatial data and information in Europe . I focused on a particular component of INSPIRE architecture which is CSW based geo-catalog service.
As sharing information makes human life easer, it also makes the developer life complex. When information was produced and maintained locally; it was easy to handle them properly. But when information was produced by one and need to be shared with others; then it was not too easy to maintain. Developer need to handle heterogeneous information. Different organization presents information in different way. So the term semantic has been raised.
The dictionary meaning of the word semantic is to relating to meaning or study of meaning. Semantic Analysis is the method for eliciting and representing of knowledge about organizations.
In GIS world, different organizations present their information in their own way. So, heterogeneity problem arises. As in interoperable GIS share information among different organizations, they need to be avoid heterogeneity problem in the presentation of data. People of GIS identify three types of heterogeneity.
For representing cartographic information different model can be used; such as angles can be representing with different coordinate system. OGC (Open Geospatial Consortium) offers GML (Geographic Markup Language); which is a specification oriented language to provide common format for representing geographic information. Thus the syntactic heterogeneity can be resolved.
Organization can set e mental model for representing information; such as can use different theme for presenting same attribute. Metadata describes the structure of the representation schema and can avoid the structural heterogeneity.
Different organizations can use different mental model; they can categorize the representation of information in their own way. These categories are mainly related to thematic concepts of the geographic information.
1.1.1 Semantic heterogeneity in the geo-information domain
During past two decades the semantic heterogeneity has been studied. It is mainly studied in the domain of sharing information. Semantic heterogeneity is a main roadblock to sharing information in GIS.
Semantic heterogeneity in GIS can be caused for difference in formalization, conceptualization, assumption of context etc.
Many GIS tools have their own data languages; some geo data languages are quite complex while some other are relatively simple. Different type of languages also defined for exchanging information. Information can be presented different way for one language or different language. These two factors; language difference and presentation difference introduce formalization heterogeneity.
Sometimes there are fundamental differences for modeling the information of real world. So the data are quite difficult for accessing for a certain application; this is called conceptualization heterogeneity.
There are lot of word, phrase, statement which help to find the meaning of information in GIS. They may differ from one organization to another. This situation is known as context heterogeneity. When different organizations try to integrate or share their information; context heterogeneity comes out and it need to be solved.
Actually, where there is disagreement about the meaning, interpretation, presentation of information; semantic heterogeneity arises. So, people of GIS need to know what the meaning is, how meaning is conveyed, and the process of communication. Mechanisms for overcoming semantic heterogeneity in order to share and integrate information through different sources; known as semantic interoperability or semantic integration. The problem of semantic integration has motivated and important research area in GIS now a days.
1.1.2 Semantic search by matching
Ontological descriptions of information sources are introduced to overcome the heterogeneity problem in geographical data. Ontology is a formal specification of the concept of a domain. Ontology improves the quality of the descriptions of information about sources; also makes the meaning of the content interpretable for machine. Now, users can express their queries; and the thematic integrated information comes from the implicit relationship between search terms. Classification of information source also becomes more flexible through ontology.
CSW provides service to the users query. Finding, translation and integration are three types of semantic query .
Finding enables the users to find datasets and its value, which contains information on a particular theme.
Translations enable users to find out the vocabulary of thematic information, which is understandable by the system; if data cannot found directly.
Integration integrates a new dataset from the combination of different datasets where the thematic information exists.
Though three types of operations for query are described in , current catalog only provide simple key-word based service for finding datasets.
The aim of this thesis is to improve the searching facility of GeoNetwork. In this paper first I have given focus on the discussion of GeoNetwork, its structure. Then I have discussed on its current searching process and its limitation. Next I have focused on semantic search, which can improve the searching quality. The basic semantic search is also described in this thesis. At last how to integrate the semantic search in Geonetwork is described.
1.3 Structure of the thesis
In the next chapter, I have briefly discussed about the problem scenario of application domain. At first I have discussed the current scenario of application domain, where it is used. Then I have discuss the currently used searching method, its drawbacks and give an idea which gives more accurate searching result to users.
The application domain is described briefly in chapter three. In this section first I have discussed about the technology used in Geonetwork. Then it goes to architecture, features of existing system. At last structure of database system is discussed briefly with sample data.
In chapter four, I have discussed with semantic search. Why it is useful and which application can use it, is described here. At last I have discussed detail, about the steps of basic semantic search method.
In chapter five I have discuss how the semantic search can be integrate with GeoNetwork to make the product more rich for user. The plug in architecture has been described.
2. Application Domain
Geographic Information System (GIS), remote sensing, agro-meteorological and other environmental observation serve for collecting and processing data for geo-referenced related issue, such as environment and natural resource management, food production and food security, coastal zone monitoring, desertification, biological diversity, energy and climate change impact. Other sources like socio-economic, demographic, administrative or political boundaries, infrastructure, transport networks etc can also be merged with geo-referenced data for depth analysis.
Some techniques and tools such as GeoNetwork, Climpaq , Global Terrestrial Observing System GTOS , Land Cover  etc is used to acquisition, analysis, distribution of geo-referenced data.
GeoNetwork is an open source cataloging geo-referenced tool, specially for referenced resources. It is based on standardized and decentralized spatial information management environment.
FAO(Food and Agricultural Organization of the United Nations) joined with the research and mapping expertise of WEP(World Food Program), UNEP(United Nations Environment Program) and UN-OCHA(United Nations Office for Coordination of Humanitarian Affairs) to develop GeoNetwork for sharing their spatial databases including digital maps, satellite images and related statistics effectively. Its purpose is to improve access and integrated use of spatial data and information by FAOs member countries; also assist as a knowledge repository of spatial information.
GeoNetwork is designed to enable the access, exchange and share of geo-referenced databases, cartographic products and related metadata; collected from different sources between organizations and their audience through internet. The aim of this approach is to provide facility to the community of special information users, so that they can access available spatial data and existing thematic maps which are stored in different database through world from a single entry point.
GeoNetwork provides tools for managing and publishing metadata on spatial data and related services. It allows distributed search on huge volume of metadata that come from different sources. It also provide interactive map viewer from distributed servers through internet.
GeoNetwork search engine allows simple search and advanced search based on some criteria for geographical data and information. The output result of search is a list of metadata. The metadata can be show or may download. If any interactive map available for specific metadata, it is also downloadable.
Decision makers, GIS experts, spatial analysts etc are the typical users of GeoNetwork. Digital maps, satellite images are effective communication tools and play an important role of those users.
Development planners, emergency managers of those organizations who work with cartographic products may need quick, reliable and up to date user-friendly maps, images for making better plan. GeoNetwork help them for making proper decision for their organizations by providing maps, satellite images etc.
GIS experts analyze with the geographical data. So they need geographical information which can be provided by GeoNetwork.
Special analysts of some fields like socio-economic, demographic, administrative or political boundaries, infrastructure, transport networks etc need geographic data for their better analysis. For example if meteorological analyst can perform preliminary geographical analysis and give reliable forecasts so that people can set up appropriate action in vulnerable areas.
Through exchange and sharing data between organization, increases co-operation and co-ordination of effort in collecting data; and everybody become beneficiary by making them available to everybody. It also avoids the duplication of data.
The main goal of GeoNetwork is to improve the accessibility of variety of data which are collected from different sources. Also organized and documented the data in standard and uniform way.
2.2 Problem Statement
Sharing information through GIS is increasing day by day. So, people give more concentration on searching method and try to increase searching quality. The aim of research is to give more accurate data what the user are looking for.
Apache lucene is an open source text search engine library. It is written in Java language. Applications such as e-mail clients, mailing lists, web searches, database searches etc use lucene to build their search module. GeoNetwork uses lucene for indexing and search of metadata.
Through lucene search only text result can be found, which exactly match in metadata. If the value is not exactly match then no result shown; but there may be synonyms of that word or may be some related information of the word. Lucene search dose not fulfills this requirement. Moreover, GeoNetwork supports only simple and, or, equal, not equal clause; but spatial query which have logical meaning doesnt support.
So, need to enhance the searching method and idea of semantic search is proposed. Semantic search not only check the searching keywords; also checks the context of the word. Actually semantic search uses the meaning of keyword in language to return more relevant result.
Semantic search automatically identify the concepts of searching word. Thus Besides synonyms it also finds related words. Thus if anyone search for election semantic search will retrieve metadata containing election (searching word), vote (synonym), campaigning, ballots, vote centre (related terms).
Moreover, word sense disambiguation (same word may have different meaning) may arise. For example the word bark can be represents as the sound of dog, the skin of a tree or a three-masted sailing ship. In normal search it cannot be resolved.
In case of disambiguation; semantic search can found most probable meaning from all possible meaning. Semantic analysis system takes the meaning of other words in the sentence and matches them with whole text to achieve most probable meaning.
Thesaurus search is a structured-controlled vocabulary searching. Three types of relations are available in thesaurus. They are border term, narrower term and related term. Through thesaurus search people can found the related words on same group. For an example if anyone searches for the word apple; then orange another type of fruit will also be available through thesaurus dictionary. GeoNetwork doesnt support thesaurus search.
3. The GeoNetwork Application
GeoNetwork is an open source application, which mainly deals with the geographical data. It has been developed for collecting, managing and sharing information among different organizations. The software is built on the principle of FOSS (Free and Open Source Software).
3.1 Significant Technologies in GeoNetwork
Geonetwork is a Java2 based web application. The technology used for GeoNetwork is described here.
Java Server Pages
As GeoNetwork uses Java for Server Pages (JSP), it is platform independent. It is easy to build the web application through JSP.
As GeoNetwork uses Java Servlet; so it requires servlet enable web server such as Jetty, Apache Tomcat. FAO recommends Apache Tomcat as a server. GeoNetwork can also run on desktop through Jetty which is already provided in application. Jetty server can be start up easily by calling batch procedure.
GeoNetwork uses JDBC (Java DataBase Connectivity) which is enabling to work with some DBMS (DataBase Management System) for connection of database. FAO recommends McKoi, for smaller system solution which is provided in the application. For large system database it is recommended to use MySQL, Oracle, PostgreSQL.
XML is used in GeoNetwork for presentation of metadata. It uses XSL Style sheet. For editing metadata it uses XML Schema.
GeoNetwork provides both local search and remote search. For local search, it uses Apache Lucene search engine library. For remote search, it uses Z39.50 protocol which is a client server protocol for searching and retrieving information from remote compute databases.
GeoNetwork architecture is based on Jeeves framework. Jeeves (Java Easy Engine for Very Effective System) is a product developed by FAO for presentation of database on internet. All XML and HTML outputs of GeoNetwork is provided by Jeeves. Jeeves allows developer to database access, logging management, multilingual support, service chaining, session management, rollback management.
3.2 The GeoNetwork Architecture
GeoNetwork is based on Service Oriented Architecture (SOA), Geospatial portal reference architecture . The architecture is based on Jeeves system which is described before. The basic javees framework is shown in Figure 4.
The web browser send request to Jeeves. Jeeves query result form databse if required and send XML response to web browser following XSL transformation.
The request to response works in different level. This is shown in figure 5.
Jeeves layer connects to the web browser directly by accepting http request and providing the output response. It also connected to GeoNetwork and database layer for resource allocation. GeoNetwork layer process the data with style sheets and provide XML data.
3.3 Main Features
Sharing geographical information is main functionality of GeoNetwork. Main features of GeoNetwork are described in this section.
In GeoNetwork information are represented as metadata. The meaning of metadata is data about data or information about data. Metadata is a structured set of information which is stored in DataBase. Metadata supports some standard formats. Metadata supports ISO19115, FGDC and Dublin Core standards. Through standardization users can access data effectively and efficiently. As it uses a set of terminology which is common for all specific standards; it is easy and quick to retrieve data and can display in uniform way. It ensures information consistency, avoid lost of important information.
There are some standard templates for adding metadata. Metadata can be added quickly using templates. Templates also can be created based on the existing template by administrator. As metadata follows standards; information with XML input file, followed by standard metadata can also be import through XSL transformations. The validation of metadata is checked through XSD schema. And it can be identified through UUID (Universally Unique Identifier) which is unique for each metadata. ISO19115 standards generate small and large thumbnails in JPEG, PNG or GIF format.
The administrator or owner of metadata can be edited online an XML mode.
Search is main feature in GeoNetwork. Geodata can be searched locally or from remote computer. As data are stored in standard way, the output results are display in uniform way and can download in document or PDF. If any map or image available in also can be download in its original format.
Searching in GeoNetwork can be normal plain text search or advance search based on title, keyword, abstract, place, date range etc.
Distributed search on metadata is costly in GeoNetwork; as internet connection is not so good everywhere. Data can be shared from several GeoNetwork nodes through harvesting by collecting remote metadata and stored them locally for faster access. And this process is done periodically for example one a week.
Metadata harvested based on UUID. As it is unique for each metadata in the world; it is combination of network interface address, current date time and random number. When any metadata is changed or edited locally it keep records of last change date. So, it the last update of harvested metadata is same with the original metadata then no need to harvest it again. Thus it reduces the harvesting cost.
Harvested metadata are not editable. Because if editing harvested data will lost after next application run. And it cannot be updated in the original source for synchronization problem. When any harvesting node is removed; harvested metadata automatically removed.
Hierarchy is allowed in harvesting. As, harvesting uses UUID as unique identifier, it allows hierarchy for avoiding duplicity. For example, If Node A created metadata a; Node B, C harvested a from Node A. Again, if Node D harvested a from B then actually it will get source from Node A.
3.4 Backend DataBase
The information is stored in relational database
3.4.1 Backend DataBase Connection
GeoNetwork uses McKoi as default database. But the JDBC connection can be changed as mySQL, Oracle etc through GAST.
Moreover the database can be export, import, setup with sample data with GAST. Thus using this tool a fresh setup of this application can be easily integrated with an existing database.
3.4. 2 DDL (Data Definition Language) of Backend DataBase
The DDL of GeoNetwork backend database tables are listed here table with description.
3.4.3 Some Sample Data in Tables
In this section I want to show some basic tables with sample data in order to understand what kind of data are in database.
4. Semantic Matching with S-Match
Matching is the most important features in many metadata related applications such as schema/ontology integration, data warehouses, data integration, e-commerce etc. People try to give more and necessary information to users. So, research on matching has increased day by day. Now a days user doesnt satisfy only with normal plain text search. They want more.
4.1 Related work on semantic matching
In dictionary, the word semantic means relating to meaning or study of meaning. So semantic matching is the matching technique which matches the word in terms of meaning. In computer science, semantic matching is the technique to identify the semantically related information.
Semantic matching basically works on graph-like structures. They may be conceptual hierarchies, database or XML schemas or ontology. And the output of the matching is the mapping of the semantically related nodes of the two graphs. For example, there is a word motorcycle in a graph and in another graph bike. Then they will be mapping in equivalence relation in semantic matching. Because, the words are synonym in English dictionary. The matching information is based on lingual resource like WordNet.
Now a days different kind of semantic matching is analyzed such as COMA , Cupid , Rondo , S-Match etc. For large amount of data S-Match performance is better in timing criteria, which is describe detail in  with practical data. S-Match works on lightweight ontology which is graphically structured. In the graph the label of each node is in natural languages such as English. The output of semantic relation is equivalence, more general, less general, and disjoint. If there is no relation then it produces Idk (I dont know).
Semantic matching is vastly used in many applications in areas such as resource discovery, data integration, data migration, query translation, peer to peer networks, agent communication, schema and ontology merging etc. It is used to simplify heterogeneity problem which is caused by using different terminology, different viewpoints etc.
4.2 Semantic Matching Algorithm
In S-Match algorithm it takes two trees as input. And it computes the strongest semantic relation between each pair of nodes of the tree. The possible semantic relations are equivalence (=); more general (=); less general (=); mismatch (-); overlapping (n). Among them equivalence (=) is strongest and overlapping (n) is weakest. When there is no relation it produces Idk relation.
Semantic matching is based on two basic key notes concept of labels and concept of nodes.
Concept of a label
It refers the set of document that would be classified under a label of tree.
Concept of a node
It refers the set of documents that would be classified under a node. It has a certain label and in a certain portion of the tree.
The basic S-Match algorithm has four steps. The steps are listed here and it will describe details alter in this section
Step 1: for all labels L in two trees, compute concepts of labels, CL.
Step 2: for all nodes N in two trees, compute concepts at nodes, CN.
Step 3: for all pairs of labels in two trees, compute relations among CLs.
Step 4: for all pairs of nodes in two trees, compute relations among CNs.
The S-Match algorithm is described briefly with the example of Figure 16. It uses C for concept of labels and nodes. Thus in tree A, Ccourses means concept of label Courses and C1 means concept of node 1.
Step 1: Compute Concepts of Labels
The target of step 1 is to translate the labels of each node form natural languages to internal logical languages. It can be done through following steps
Labels of each node are passed through a tokenizer which recognizes punctuation, cases, digits, stop character etc to parse them into tokens. Thus the label Earth and Atmospheric Sciences becomes earth, and, atmospheric and sciences. The lists of all multiword concepts are in WordNet. Now formalize the natural language with the connectives such as and, or etc to make it meaningful. Thus the previous example becomes earth sciences and atmospheric sciences. The sensible meanings of concepts are in WordNet.
The tokens are again structurally analyzed to figure out the possible basic forms, which are known as lemmatization. Thus the token Sciences will be Science. The aim of lemmatization is to make the word, articles, numbers etc sensible for matching.
3. Building atomic concepts
The sense of lemmas is identified from WordNet. For example the label Sciences is tokenized as Sciences, and then lemmatized as Science. In WordNet there is a sense of the lemma Science as a noun.
4. Building Complex Concepts
When there are prepositions, punctuations, conjunctions are exists in tokens they are translated to logical connectives. And produce complex concepts instead of atomic concepts. Some rules of translation are
- Commas, conjunctions are translated into disjunctions,
- Prepositions such as of, in etc are translated into logical conjunctions.
- Words like except, without are translated into negations.
Following example will clear how to compute complex concept of labels. The label History and Philosophy of Science is tokenized as History, and, Philosophy, of, Science. As there are prepositions in the tokens they will build complex concepts. Thus it produce CHistory and Philosophy of Science = (CHistory V CPhilosophy) ? CScience where CScience ,Chistory and
Cphilosophy has sense in WordNet. Point that and in natural language is converted into disjunction in logical language rather than conjunction. The reason is described .
Thus through Step 1 all labels of natural languages is converted into logical languages by following propositional formula where there is a sense of every word in WordNet.
Step 2: Concepts at Nodes Computation
Conversion of natural languages to logical languages by following propositional formula is also applicable to the concepts of nodes like labels. In S-Match the input tree is classified as hierarchical structures. The unique identification of each node is the path from root to that specific node. The meaning of that node also figures out by that formula. Thus the logical formula for a concept of node is defined as a conjunction of concept of labels (described in step 1) in the path from root node to that specific node . Thus in tree A, the concept of node C12 = CCourses ? CHistory ? Cmedieval
In this algorithm consistency of the tree is required. From this step the consistency of nodes of the tree can be figure out. So, if there is any inconsistency then it should be removed. For example, natural language label except_philosophy will be translated into Cexcept_philosophy = CPhilosophy as logical formula according to step 1. Again if there is another node in that sub tree as Cphilosophy then according to step 2 the concept of that node will be ... ? Cphilosophy . ?... Cphilosophy; which is inconsistent. In this case the user can find out that the concept of that node is unclassifiable and he need to decide which node is more important in the tree and delete the other node from the tree between Cexcept_philosophy and Cpilosophy. Important thing is that it does not improve the complexity of searching algorithm, it just work only as preprocessing of the tree.
Step 3: Label Matching
In step 3 for all pair of labels compute the relations among them. The relation is based on the library of element level semantic matchers . These matchers take atomic concepts of two labels produced in step 1 as input and provide semantic relation between them as output. Some other well known matchers like Cupid, COMA use this technique for matching; but the difference is that S-Match returns semantic relation as output while others return value between 0 to 1 ranges.
The labels of matchers which are implemented so far are summarized in table 2. The first column contains the list of matchers name, second column execution order and third column is the approximation level. Matchers with first approximation level are always correct; higher the value decreases the correctness. The fourth and fifth column represents matcher type and input for matching respectively.
Matcher Name Execution
Level Matcher type Schema info
Prefix 2 2 String-based Labels
Suffix 3 2
Edit distance 4 2
N-gram 5 2
Text Corpus 13 3 Labels + Corpus
WordNet 1 1 Sense-based WordNet senses
Hierarchy distance 6 3
WordNet Gloss 7 3 Gloss-based WordNet senses
Extended WordNet Gloss 8 3
Gloss Comparison 9 3
Extended Gloss Comparison 10 3
Semantic Gloss Comparison 11 3
Extended semantic gloss comparison 12 3
Table 2: List of element level semantic matchers
There are three types of matchers according to matching criteria. They are string-based, sense-based and gloss-based. Now I will discuss details of these.
Maximum matching techniques are based on string-based matching. In string based matchers it produces equivalent relations between input labels. If there is no relation then return Idk. String based matchers are described here.
Prefix checks with the starting of label. If one label starts with others then it returns equivalence relation otherwise Idk. Some examples of Prefix matcher are described in Table 3.
Source Label Target Label Relation
Net Network =
Hot Hotel =
Cat Core Idk
Table 3: Example of Prefix matcher
Prefix matcher doesnt always provide correct relation. For example Hot and Hotel produce equivalence relation; but this is not correct. On the other hand another two examples provide right output.
Suffix checks with the ending of label. If one label ends with others then it returns equivalence relation otherwise Idk. Some examples of Suffix matcher are described in Table 4.
Source Label Target Label Relation
Phone Telephone =
Word Sword =
Floor Door Idk
Table 4: Example of Suffix matcher
It doesnt always provide correct like Prefix. It provides wrong relation between the labels of Word and Sword.
For the calculation of Edit Distance first calculate the counting operation number of conversion from one label to another; then the result will be divided by the maximum length between two labels. The output result will be in 0 to 1 ranges. If the value is greater than a given threshold (by default .6) then it returns equivalence relation otherwise Idk. Examples are given in Table 5.
Source Label Target Label Relation
Street Street1 =
Proper Propel =
Owe Woe Idk
Table 5: Example of Edit Distance matcher
These matchers also can provide wrong matching. In the example, Street and Street1 has .86 as edit distance value which is correct; but Proper and Propel has value .83, but they are not equivalence.
NGram, counts the equal number of subsets (which are sequence of n characters) in input labels. For example the trigrams of the street are str, tre, ree, eet. If the counting value is greater than a given threshold (.6) then it returns equivalence otherwise Idk. The results are summarized in table 6.
Source Label Target Label Relation
Address Address1 =
Behavior Behaviour =
Door Floor Idk
Table 6: Example of NGram matcher
It also doesnt provide right output always.
2. Sense Based Matchers
Sense based matchers are based on WordNet. WordNet is a lexical database which contains a large storage of English related terms. This is also available in online. WordNet contains sense of words, their synonyms. Each sense has a gloss that defines the concepts of the word. For example the words night, nighttime, dark has a single sense that has a gloss which is the time after sunset and before sunshine while it is dark. Senses are connected as hypernym, hyponym, synonym or antonym relations. The output semantic relations will be produced by following rules form the relation of WordNet.
- A = B if A is a hyponym, meronym or troponym of B;
- A = B if A is a hypernym or holonym of B;
- A = B if they are connected by synonymy relation .(night and nighttime from
- A - B if they are connected by antonymy relation or they are the siblings in
the part of hierarchy
For example, tree is a kind of plant, tree is hyponym of plant and plant is hypernym of tree. Analogously from trunk is a part of tree; we have that trunk is meronym of tree and tree is holonym of trunk; which is described detail in  Sense-based matchers are described here.
It is an element level semantic matcher. It provides equal, more general, less general as output according the relations of two input senses in WordNet; otherwise it produces Idk. Examples of WordNet matcher are given in Table 7.
Source Label Target Label Relation
Car Minivan =
Car Auto =
Tail Dog =
Red Pink Idk
Table 7: Example of WordNet matcher
The accuracy of the result depends on the content of WordNet. Heavily rich in WordNet provide more accurate result. Developers can extend WordNet in their specific domain.
Hierarchy based matchers compute the difference of level between two concepts, if they are in same hierarchy in WordNet. If the result is less than a given threshold then it returns equivalence otherwise Idk. Example is given in Table 8.
Source Label Target Label Relation
Red Pink =
Catalog Classification Idk
Table 8: Example of Hierarchy matcher
Hierarchy is a good approach of matching but its main problem is that it depends on the internal structure of the WordNet. Sometimes two words of same sense may be in different hierarchy tree, so it will produce Idk.
3. Gloss Based Matchers
Gloss based matchers takes two input in label WordNet glosses and returns semantic relation between them. In WordNet glosses there are explanation of the word in natural languages. Some matchers that are based on gloss matching are described here.
In this matcher, first input label checks with the WordNet gloss of the second one. At first the sense of the first input is figure out form WordNet. Then it compares with the gloss of the second input and count the number of occurrence of the word. If the number exceeds than a given threshold then second input is less than first input; if the word is not exist or less than the threshold value then Idk is produced. Maximum time the explanation of a word is explained through a specification of the more general concept; so the relation is less general. Some examples are listed in Table 9.
Source Label Target Label Relation
hound Dog =
hound Ear =
dog Cat Idk
Table 9: Example of WordNet gloss matcher
The gloss of hound is any of several breeds of dog used for hunting typically having large drooping ears. It is less general for both dog and ear; because both of the words are in the gloss. But hound is a dog with spatial properties; so the first relation is correct but the second is not.
WordNet Extended Gloss
WordNet extended gloss matcher compares the first input with the extended gloss of second input, instead of the gloss of the second one. The extended gloss means the gloss of the ancestors or descendants of the input in WordNet hierarchy. A threshold is given for the maximum allowed for the depth of the hierarchy level. By default only direct descendants or ancestors are considered.
If number of existence of the first input label exceeds the threshold value of the extended gloss then a semantic relation returns; otherwise Idk is produced. The type of the relation depends on the existence of the upper level or lower level of the hierarchy. If it exists in descendants then less than relation and if it is in ancestor than greater than relation returned. Example is listed in Table 10.
Source Label Target Label Relation
Dog Breed =
Dog Cat Idk
Wheel Machinery =
Table 10: Example of WordNet extended gloss matcher
In gloss comparison matcher it takes two glosses of the given labels. Compare the similar number of words in two glosses, if it exceeds a given threshold than it equivalence relation is returned, otherwise Idk. Some example is given in Table 11.
Source Label Target Label Relation
Afghan hound Maltese dog =
Dog Cat Idk
Table 11: Example of Gloss Comparison matcher
The glosses of Afghan hound and Maltese dog are Maltese dog is a breed of toy dogs having a long straight silky white coat and Afghan hound is a tall graceful breed of hound with a long silky coat; native to the Near East respectively. The number of common words are 4 (breed, long, silky, coat). So, two labels can be taken as equivalent.
Extended Gloss Comparison
In this matcher it counts the similar words between two extended glosses of the input labels. If the first extended gloss has common words (exceeds a given threshold) with the descendant extended gloss of the second one; then the first one the more general than the second one and vice versa. If there is lot of common words for both ancestor and descendants than it returns equivalent relation. Some example is listed in Table 12.
Source Label Target Label Relation
Dog Cat =
House Animal Idk
Label matching algorithm produces a matrix which contains semantic relation for all pair of atomic concept of label produced in step 1. And the relation is determined by using the element level semantic matcher which is listed in Table 2, according to configuration setting of the algorithm. The execution orders of the matcher are given in the algorithm. A part of the relation of the matrix of figure 16 is listed in the following table 13.
ClassesB HistoryB ModernB EuropeB
CoursesA = Idk Idk idk
HistoryA idk = Idk idk
MedievalA idk Idk - idk
AsiaA idk Idk Idk -
Table 13: Relation Matrix of Level of Concepts
Form example, we can see that, Course and History are synonym in WordNet. So they are in equivalence relation. Europe and Asia is the sibling; so they are in disjoint relation.
Step 4: Node Matching
In this step, the relations are computed for all pairs of node between two trees. At first a tree matching algorithm is initialized in order to compute node matching for all pair of nodes. Both of them are described here.
1. Tree Matching Algorithm
Tree matching takes input two pre processed trees which are obtained from step 1, 2 and the relation between atomic concepts of label from step 3. And the output result is a matrix which contains the semantic relation between each pair of nodes in both trees. The pseudo code  of tree matching algorithm is given in the following figure 17.
1. StringTreeMatch(Tree of Nodes source, target, StringcLabsMatrix)
2. Node sourceNode,targetNode;
3. StringcNodesMatrix, relMatrix;
4. String axioms, contextA, contextB;
5. int i,j;
6. For each sourceNode in source
8. contextA=getCnodeFormula (sourceNode);
9. For each targetNode in target
11. contextB=getCnodeFormula (targetNode);
12. relMatrix=extractRelMatrix(cLabsMatrix, sourceNode, targetNode);
15. return cNodesMatrix;
It takes input two pair of nodes of two trees (source and target) and the matrix which contains the relation between atomic concepts of labels. It contains in two nested loop so that it can cover all pairs of nodes. Using getCnodeFormula we can find out the atomic concept of labels of that node. Thus the atomic concept of source node and target node are stored in contextA and contextB from line 8 and 11 respectively. The relations between concept of labels are stored in relMatrix from extractRelMatrix function in line 12. We assume the reason relation between concepts of nodes which is stored in axioms in line 13. Finally the semantic relations between two nodes are calculated through NodeMatch algorithm in line 14. The details of the algorithm will be described in the next section.
For example the pair of nodes C4 in A and C4 in B the relation between all pair of concept of labels are ClassesB = CoursesA and HistoryB = HistoryA. So the axioms will be (ClassesB ? CoursesA) ? (HistoryB ? HistoryA).
2. Node Matching Algorithm
The node matching algorithm takes the atomic concept of labels of two nodes and their axioms which are obtain from tree matching algorithm as input and it returns the relation between those pair of nodes.
1. String nodeMatch(String axioms, contextA, contextB)
2. formula= And(axioms, contextA, contextB);
4. boolean isOpposite= isUnsatisfiable(formulaInCNF);
5. if (isOpposite)
6. return -;
7. String formula=And(axioms,contextA,Not(contextB));
8. String formulaInCNF=convertToCNF(formula);
9. boolean isLG=isUnsatisfiable(formulaInCNF)
10. formula=And(axioms, Not(contextA), contextB);
12. boolean isMG= isUnsatisfiable(formulaInCNF);
13. if (isMG && isLG)
14. return =;
15. if (isLG)
16. return = ;
17. if (isMG)
18. return = ;
19. return Idk;
In line 2 it constructs formula for testing disjoints. The formula is converting to CNF through convertToCNF in line 3. The rules for CNF formula according the relation between the concepts of labels are listed in Table 14. Now in line 4 the CNF formula is checked for unsatisfiability. If it is unsatisfiable the disjoint relation is returned. Then it goes for more than and less than relation. If both more than and less than both relation hold then it returns equivalence relation otherwise the corresponding relation. If no relation stratified then it return Idk.
rel(a ,b) Translation of rel(a , b)
into propositional logic CNF (Conjunctive Normal Form ) Formula
a=b a?b N/A
a= b a?b Axioms ? contextA ? contextB
a= b b?a Axioms ? contextB ? contextA
a-b (a?b) Axioms ? contextA ? contextB
Table 14: CNF Formula for the relations of concepts of labels
Thus for C4 in tree A is less than C4 in tree B the at first CNF formula will be ((ClassesB ? CoursesA) ? (HistoryB ? HistoryA)) ? (ClassesB ? HistoryB) ? (CoursesA ? HistoryA) which is unsatisfiable. Again for more than relation the CNF formula will be ((ClassesB ? CoursesA) ? (HistoryB ? HistoryA)) ? (ClassesB ? HistoryB) ? (CoursesA ? HistoryA) which is also unsatisfiable.
So, finally the equivalence relation will return. A part of the relation between concept of nodes are described in Table 15.
C1B C4B C14B C17B
C1A = = = =
C4A = = = =
C12A = = - -
C16A = = - -
Table 15: Semantic relation between concepts of nodes
5. Integration of Semantic matching in GeoNetwork
Searching quality of an application can be improved by giving more relative information to user. In normal searching method, people can find only the exact information; which cannot satisfy the user requirements now.
Research on semantic heterogeneity problem in is one of popular topic in now days. Moreover, Geo information system needs to share information. So, it needs to integrate semantic matching in Geo information system to make easy the way of sharing information.
On section 1 I have briefly discussed about the semantic heterogeneity problem in Geo information system. Current searching method and its problem in application domain is described in section 2. About details of GeoNetwork and Semantic Matching are described in section 3 and 4 respectively.
As, GeoNetwork normally uses lucene search method; it cant find the semantic relation. Moreover thesaurus search is also not possible in current GeoNetwork application. Integration of Semantic Matching can solve this problem. Using S-Match people can grasp more related information. Moreover, thesaurus search can also be applied through S-mathc. In this section I have discussed about the architecture of integration of semantic matching in GeoNetwork.
An architecture has been designed for integration of S-Match in GeoNetwork in . The architecture is shown in Figure 19. Now I am describing the steps in details with an example. Suppose forests and mountains in Trento is searching query which can be taken as an example.
The total system can be divided into three parts. One of them is directly integrated with user; here it is shown in BEA ALUI 6.1 frame. This part takes input form user and gives searching result to user i.e. the interface part of GeoNetwrok which is viewable to user. The second part analyzes the result, and matches it with WordNet. The S-Match part is implemented here. It is shown in SWeb P3 frame. This is the middleware network layer. The last part is GeoNetwork part. This is the backend part of geonetwork. In this part, searching result is produced form Geo data repository with the searching catalog, produced in semantic web layer.
Semantic Search Client
Now GeoNetwork supports only normal plain text search by lucene searcher. In order to plug in semantic search in GeoNetwork; at first needs to establish a new search client in GeoNetwork application. Semantic search client deals with the semantic searching part.
GeoNetwork supports three types of search; what, where and when. In case of semantic search; at first application sends the searching query to the semantic search client. This is done in the search method of GeoNetwork.
Semantic search client takes input query form the user and sends requests to the semantic web for analysis.
A user wants to know some information of metadata, which is related to forests and mountains in Trento in GeoNetwork. So, he needs to ask forests and mountains and Trento in what and where input boxes of Geonetwork respectively. Then, semantic search client will send the request to query analyzer. The input query will be forests and mountains and location will be Trento in query analyzer.
Query analyzer takes users request form semantic search client. It analysis the query and makes it applicable for S-Match. As, S-Match find out relation between two tree based input files; query analyzer makes the query like tree structure.
Query analyzer parses only the query part of users, not location. In the example it will be a tree with only one node. And the node will be forests and mountains.
Synonym, disjoint, more general, less general information are retrieved through semantic matching. So, it needs universal knowledge of the meaning of words to find out relation. S-Match uses WordNet as a dictionary.
WordNet is a lingual English dictionary. The words are in tree structure in WordNet. Every word in the WordNet has a definition which is known as sense of the word.
Run Time Matching
S-Match is applied in this phase. At first, tokenize the input. Then lemmatize it. These are the preprocessing of matching. And this is done by preprocess method, which is the first step of S-Match. Then match the lemma through WordNet matcher to create a catalog of synonym, more general, less general words of it. This is done in element level matching in S-Match. In the meanwhile figure out the catalog for other languages with Agrovoc.
In the running example, tokenize the input query as forests, and, mountains. Then lemmatize the forests and mountains to forest and mountain respectively. Then matches the lemmas(forest, mountain) through WordNet. This matcher returns a search catalog of synonym, more general, less general words of lemmas. More general, less general parts are known as related terms in semantic matching. In the running example, the matcher will return wood, woodland, timber, timberland as synonym and tree, plant, shrubs, wooded area, afforest as related terms of forest respectively. Consequently, it will return mount as synonym and hill, peak as related terms of mountain respectively.
Ontology is a formal specification of the concept of domain, which describes the information about sources and makes it interpretable for machine. In this case Agrovoc is user defined ontological domain.
Agrovoc is a multilingual structured thesaurus of all subject fields in Agriculture, Forestry, Fisheries, Food security and related domains (e.g. Sustainable Development, Nutrition, etc). Agrovoc has that information which is associated with metadata of application.
Design time matching
As GeoNetwork supports multi language, but WordNet is only for English. So, it needs to match the WordNet with a user defined multilingual ontological domain which is Agrovoc. A mapping is designed between WordNet and Agrovoc. So, for any English word in WordNet, it is easy to find its corresponding word in other languages through Agrovoc. Here, update the searching catalog for other languages.
Search catalog takes the catalog which is the output of run time matching and also the updated catalog form Agrovoc, and creates search data result with help of metadata catalog.
Search data makes the output result form geo data repository. And this is done by the search method in GeoNetwork. Here, the searching procedure is based on the searching catalog. Then send the searching result to user. The result can be shown as metadata or interactive map viewer which is currently supported by GeoNetwork. Hence, user can retrieve information from GeoNetwork through semantic search client as well.
GeoNetwrok wil search for all terms, which are returned by S-Match. Moreover, the search will applicable only specific location or time period according to user requirement. In the running example, the searching will be proceed only metadata of Trento.
5.2 Implementation Details
Now I will describe the method details on how to integrate S-Match in GeoNetwork. Currently, GeoNetwork uses Lucene search as a search technique. Moreover, it uses Z 39.50 protocol for remote searching. Currently, two searching clients are available in GeoNetwork to perform this searching process. They are implemented in LuceneSearcher and Z3950Searcher classes. Both classes extend MetaSearcher class.
To plug in any new search technique in any application, at first we need to create a search client. The search client holds the techniques for searching. So, to integrate S-Match in GeoNetwork, we need to create a searching client which will be implemented in proposed SMatchSearcher class. And, this class will extend MetaSearcher class.
First step for integration is to initialize the SMatchSearcher in exec method of Search class of GeoNetwork like other searchers. The pseudo code is in following Figure 20. Here, a searcher for S-Match is created in line 6 and the input query is send to searcher in line 11.
1. Element exec (string query)
3. SearchManager searchMan;
4. MetaSearcher searcher;
5. If (smatch)
6. searcher = searchMan.newSearcher(SearchManager.SMATCH);
7. Else if (remote)
8. searcher = searchMan.newSearcher(SearchManager.Z3950);
10. searcher = searchMan.newSearcher(SearchManager.LUCENE);
Now, add and initialize the SMATCH searching type in newSearcher method of SearchManager class. The pseudo code is in Figure 21. SMatchSearcher is initialized in line 7.
1. newSearcher (int type)
3. switch (type)
5. case LUCENE: return new LuceneSearcher();
6. case Z3950: return new Z3950Searcher();
7. case SMATCH: return new SMatchSearcher();
The pseudo code of search method of SMatchSearcher class is in Figure 22. In search method of SMatchSearcher class, first call query analyzer in line 3, which analyze the query and converts the query in tree structure. Then, call the preprocess method of S-Match for tokenize, lemmatize the labels of the query tree. Then call TreeMatch method of S-Match, which matches between query tree and WordNet tree. The pseudo code of tree matching algorithm is in Figure 17.
1. Void Search (query)
3. IContext treeQuery = queryAnalyzer (query);
4. IContext preProcessedQuery = preprocess (treeQuery);
5. IMatchMatrix CNodMatrix = treeMatch (preProcessedQuery, wordnetContext);
TreeMatch method of S-Match returns a matrix, which have the relation between each pair of nodes of the tree. If input query of GeoNetwork matches with WordNet tree, then it returns a large number of relations for each pair of nodes. Maximum of them will be Idk (I dont know) relation; because input query is a small part of WordNet dictionary. But, GeoNetwork expects only synonym and related terms of input query. So, S-Match will return only those data which have synonym, more general, less general relation with input query.
5.3 S-Match as Web Service
Web service is a way to use some services and functions which are distributed around the world. Hence, we can get those facilities without replicating those services and functions locally. Consequently, if we establish S-Match as a web service then it would be good idea to provide this search technique to the world. Moreover, if this service provider upgrades or updates its service then users can be facilitated this search technique being up-to-date.
S-Match uses wordNet as its linguistic dictionary. Wordnet is consistently updating their database. So, if we use another web service to interact with wordnet then we can be always updated with the dictionary.
But, whenever we use any application that uses web service, internet connection is a must. But, GeoNetwork is usually employed in areas like Africa, Asia where internet connectivity is limited. So, the harvesting mechanism is established in GeoNetwork for accessing remote database (as described in manual of GeoNetwork). So, if S-Match is published as web service, then it will not be possible to use while there is not internet connection available. Having both S-Match API and S-Match as web service in Geonetwork could be a solution to get optimum service.
If S-Match is published as web service then, GeoNetwork makes the query as tree structure by query analyzer; then sends the query to S-Match web service. S-Match matches the query with WordNet and returns the result as searching catalog to GeoNetwork.
So, publishing S-Match as web service for GeoNetwork will be good approach; if regular internet connectivity is available.
To implement S-Match as web service and to use it from GeoNetwrok, I have re-designed the plug in architecture which is shown in Figure 23. This architecture is the modification of proposed architecture in  which is shown in Figure 19.
First of all we need to make S-Match as S-Match web service, and then publish this service to the whole world. We have XML input/output API to interact with other programs to provide this service. This API accepts SOAP request from other program and response to the request.
To get S-Match web service, the Geonetwork needs a special XML input/output API to communicate with the web service. Moreover, this API does some extra jobs. It converts the output tree of Query Analyzer into content of XML SOAP body. Finally, this API builds a complete XML SOAP request and sends to the web service provider.
Let see how this architecture works. Semantic search client send users request to Query Analyzer. Query Analyzer then converts the input request to tree structure and send it to XML input/output API. This API converts the tree into XML format and creates the SOAP message for sending request to S-Match web service provider.
The XML input/output API of S-Match web service receives the SOAP request, and sends the message body of SOAP message i.e. the request query to S-Match. S-Match then matches it with WordNet and sends the searching catalog to XML input/output API. This API then converts the message into XML SOAP response and sends back to XML input/output API of GeoNetwork. This API extracts the message body i.e. the search catalog and sends it to GeoNetwork. Then this catalog matches with Agrovoc and update the catalog for different languages. The search method of GeoNetwork matches the updated catalog with GeoData repository and creates the searching result and sends the result to user.
6. Conclusion and Future Works
In this thesis I have described details about GeoNetwork, its structure. Current searching method and its limitations are also described. Currently GeoNetwork has semantic heterogeneity problem like other GIS softwares. The idea of semantic search can solve this problem. It can give more information while searching and it makes the product more reachable. The basic S-Match has also been described in this thesis.
I have also described that integration of semantic search in GeoNetwork and how it can improve the searching quality of GeoNetwork. Through GeoNetwrok people can retrieve related information as well. Moreover, thesaurus search also be possible. Implementation of basic S-Match has been released now. An architecture has been designed for integration of S-Match in GeoNetwork . I have described details of it. I have also described where and how initialize S-Match to GeoNetwrok in implementation level.
The idea of publishing S-Match as web server is a good approach. Moreover, it has also some drawbacks. In this paper I have given an idea how to publish S-Match as web service. In future, it can be implemented.
The plug in procedure of S-Match in GeoNetwork which is discussed in this thesis will give a better opportunity in searching procedure in real life world.