Arabic Semantic Web Search Engine Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In the advent of Internet, human affairs have continuously been dependent on computer-generated procedures to make time-consuming and tedious tasks faster and easier. One human activity now facilitated by Internet tools and applications is information search, storage, and retrieval. Computer experts such as computer programmers and scientists continuously develop software, application, and architecture to improve the process of information search. An Internet search tool developed to facilitate information search is search engines.

Users of search engine make their queries by inputting keywords in search bar or fill up form in the advanced search button and the search engine responds by deriving relevant results from index and stored documents. However, there information search remain problematic. Hence, to address this, Semantic Web is developed which is expected to change existing search engine in the future.

We propose a new architecture for Arabic Semantic Web or SemARAB including a user friendly Graphical user Interface (GUI) for Arab users and semantic ontology. This paper discusses SemARAB sub model, along with GUI, modules used, and semantic ontology for e commerce. The proposed model was also tested and compared to other search engines. Further improvement of the proposed architecture is to expand it to be applicable to other domains.


In the advent of wide Internet and computer use in almost all human affairs, there is a growing demand for continuous improvement of tools being used. Accordingly, computer scientists develop various technologies with the aim to perfect Web-based communication. Likewise, computer programmers formulated complex language codes and information storage system with the same purpose to improve online communication. Aside from online communication, information search is yet another integral human need catered by the Internet. The overwhelming and vast content available in the Web which leads to information overload to the users and information explosion to the system have made computer experts to develop a keyword-based search engines as an Internet research tool.

However, many experts still conceive information search problematic in terms of ease of use and precision. Although keyword-based search engines are viewed to facilitate information storage, dissemination and search, there have been some concerns regarding their use. Information search engines are criticized for being: "high-recall, low precision;" "low recall, no precision," sensitivity of results to vocabulary and are only limited to general meaning, and production of results in single pages. Moreover, search engines cannot be entered into a computer programming language equipment for reformatting/editing (Antoniou & Van Harmelen, 2004).

One of the proposed solutions to these problems is the Semantic Web Technology or the systematic initiatives implemented by members of the World Wide Web Consortium with the aim to improve quality of services in the Internet- a process conducted in layers focusing in improving the use of explicit metadata, inferential agents, ontologies and logic (Antoniou and Van Harmelen, 2004).


Semantics is the process of improving a domain or field knowledge to make it more adaptive user friendly, efficient, and intelligent. Through the use of expertise in this field of knowledge, software and other applications can be enhanced to their maximum functionality and performance. Semantic has been viewed to have potentials for developing software that are easier and more efficient to use. Furthermore, this is envisioned to make a breakthrough in software development (Lee, 2004). Semantic Web Technology helps reduce information overload by linking stovepipe systems and enriching poor content aggregation. This technology reduces blockage of information network by linking different systems and improving the way program content is populated (Daconta et al., 2003).

Semantic Web Technology was conceptualized by Tim Berners Lee in 1990 to create the World Wide Web (WWW) of documents which are linkable throughout the world through complex connections of networks and within networks. In 1994, Berners Lee was able to implement the technology for meaning transmission rather than just simple file transfer through keyword-based search engines to access the web more efficiently (Orr, 2005).

This paper proposes an Arabic semantic search tool named SemARAB, a search paradigm which aims to increase effectiveness of a retrieval system by identifying an additional semantic layer for the results hit by the search engine. This tool model is based on semantic similarity between concepts from a specific ontology and content-based similarity for different resources. This tool is proposed to address the problems of semantic search for Arabic data, Arabic language processing and the absence of resources in Arabic language annotated with semantic metadata which are necessary to adapt to the needs of the Arabic users.


Semantic Web is expected to take over the World Wide Web in the future. Although it promises more efficiency and usability as an Internet search engine, its tools and languages are not ready to cater to the needs of the Arabic users. Apart from being efficient and user-friendly, semantic web is also expected to address the challenges regarding complicated but widely used Arabic language.

Al Khalifa and Al-Wabil (2007) criticized the existing Web Semantic tools and applications because of the following factors:

lack Arabic language support in the available and current Semantic Web tools;

lack of existing Arabic Web Semantic Web applications;

limited support for Arabic research on Semantic Web technology; and

inconsistency in applications and schemes used to encode Arabic scripts.

SemARAB aims to improve the Semantic Web as a search engine by understanding more about information content storage, dissemination, and retrieval of the users. SemARAB is expected to provide the user with easy to use interface by determining the keywords and the type of object the user needs. SemARAB operates in the following steps:

use the search engine to obtain search results,

filter the results or hits of the query made, and then

rank them based on the concepts ontology to show documents referring to the chosen denotation.

Figure 1 shows a SemARAB sub module architecture.

1.Web Searching

User Interface

2.Data Extraction

3.Data Filtering

4. Identify Objects

5. Ranking Results

Return Results





Figure 1: SemARAB Sub module Architecture

The details of the SemARAB architecture are the following:

Originally, Web Search use general search engine to find all documents' URLs related to search query of the user according to user keywords and ontology. For example, the search for "William" as an organization, the query will run within search engine with keywords of William including all related concepts to organization from the ontology.

Then, the second stage is Data Extraction. Commonly, several research groups have focused on the problem brought about by extracting data from HTML documents. Using SemARAB, the content of the web data for the first 100 search results provided by the search engine will first be extracted for greater efficiency of data extraction.

Accordingly, the data obtained from data extraction will then be filtered. The search engine shall look for keywords provided by user plus all concepts from the related ontology. In case of multiple results, the document will be kept for document ranking. Data filtering procedure supports Arabic language processing for the concepts search similarity.

Lastly SemARAB identifies objects based on similarity measurement between the concepts ontology and the extracted documents. The module then starts to create different hits displayed for access to the user in a ranked order based on the frequency of ontology concepts which exist from the extracted documents.


Form-based query is a popular method used for websites. This type of query allows the user to fill out a form and specify all kinds of search criteria via an "advance search" query. Afterwards, the form will be processed to generate the query and executed to retrieve the data specified by the user via his search query sent. The rationale of option boxes or form-based query is that the process of finding, retrieving, and delivery of results will be very much easier, and perhaps faster, because the execution of process will be limited to the areas selected by the user to narrow down his search. Popular search engines such as Yahoo!, Google,, and others, uses form-based query where the users can narrow their search to web, images, news, and others or geographically like US or worldwide in

Figure 2 shows the graphical interface for SemARAB. With the use of this interface, SemARAB allows the user to search or enter query in five areas of Arabic ontology for e-commerce including: persons, sales, operations, organizations, and various. Queries that may not be classified within the first four options can be searched via the "various" option.

Figure 2 The SemARAB GUI


Search engines operate on various search processes, software, and modules. SemARAB operates on two important modules: Data Filtering Module and Identifying Object Module. This section we will discuss in detail how these two modules operate.

Break into words

Remove stopwords


Object similarity


Weighting Documents


Words (Tokens)

Non-stoplist words

Stemmed words

Term weight

Object & Rank


Extracted Document

Data filtering

Identifying Object Figure 4 shows the components of Data Filtering and Identifying Object Module.

Figure 4: Components of Data Filtering & object Module

Data Filtering Module

The aim of Data filtering module is to measure Arabic concepts' similarity between results returned by the search engine during each query and the concepts from the established ontology. The operations of this proposed module is divided into the following stages:

Tokenization. Each result from the search engine must first be tokenized. Moreover, the results will be calculated in terms of frequency words from the searched keywords or ontology. This is to ensure that every sequence of character has a space before and after use as a token in the Arabic language. A word can be classified in the Arabic language as pronoun (harf), verb (fe'l) or noun (esem). However, the most common challenge of using keyword-based search is that a pronoun conjunct with nouns and verbs at the beginning or end of the word cause dramatic changes in the structure of the word.

Elimination of Stop Words. Stop words are words which are repeated frequently in Arabic language but do not have any importance and do not convey any meaning for the similarity measurement. These unnecessary symbols should be removed in order to make the search more efficient, narrowed down, and for that matter, faster for the convenience of the user.

Figure 3 below describe the process involve in this stage:

Convert the encoding to Unicode

Remove punctuation (;, ،, /, \, ?, !, ", *), Remove diacritics ( َ ً ُ ٌ ِ ٍ ّ), none letters, ال, ل, فال, لل

Replace أ , إ , آ with أ , ى with ي, ة with ه

Figure 3: Process of word normalization

Stemming. Arabic is a Semitic language whose basic feature is that most of the words can be built up from and analyzed down to root words with exceptions to common noun and particles. Morphological analysis developed by Khoja and Garside (1999) first removes layer of prefix and suffixes, then checks a list of patterns and roots to determine whether the reminder could be a known root with a known pattern applied. If so, it returns the results of identified root word. Otherwise, it returns the original word, unmodified. This system also removes terms that are found on a list of 168 Arabic stop words.

Identifying Object Module

This module identifies the related objects among the established concepts ontology in the documents extracted. It will compares between the documents words and the ontology concepts for e-commerce. Then, it will store the ontology concepts which are found on the document.

The Identifying Object module is divided into the following two stages:

Object Similarity. Cosine similarity is used to measure similarity between the extracted document and the ontology to identify the objects.

Weighting. The weighting process is the most important process because it gives a rank reflecting the importance of the words.


SemARAB uses the Google search engine to provide the traditional text search results. The established concept ontology for SemARAB e-commerce is shown in Figure 5. From the five main areas (Organizations, Operations, Sales, Persons and Various), the ontology further stems to more specific areas in order to narrow down the search query and to be more organized in indexing the documents in the storage packets which can be always ready for dissemination and retrieval. This shows how the concepts are related and how these can be differentiated.

Figure 5 E-commerce Arabic Ontology

WordNet defines ontology as a rigorous and exhaustive organization of a set of knowledge in a certain field domain that hierarchical and contains all relevant entities and their respective relations (WordNet, 2010). Ontology is used in semantic web to decide and identify the relationships and difference between terms in terms of their meaning (Maedche and Staab, 2001). Ontology can offers an organized approach to represent concept properties, and relationships between these concepts for a specific domain in a semantic way.

However, there is no currently available ontology for any domain in Arabic language. Hence, we build our ontology for "e-commerce" domain in Arabic language by analyzing more than 100 e-commerce web sites such as and in order to collect and determine the structure for the ontology.


Users of existing search engines such as Google and Bing make their query by entering keywords or fill up a form using the advance search options. Upon receipt of the search query, the search engine responses by matching the entered keywords to its index. The relevant results were then ranked based on the frequency of appearance in the document. The relevant documents are then showed ready for dissemination and retrieval (Diao et al, 2000).

On the other hand, SemARAB differs from the typical search engines because it uses concepts ontology as an additional layer for semantic web to enhance search relevancy and return less number of unrelated results. It provides user a GUI with divided areas to construct queries and explores the results. The GUI gives the user the ability to determine in which branch of ontology to search. Table-1 below shows the capability of SemARAB and some search engines to answer user queries and return the more relevant results.

Parameter name










Ahmad as a person

68 %

32 %

35 %

65 %

32 %

68 %

HP laptop as a sales

90 %

10 %

81 %

19 %

80 %

20 %

AlOthaim as an organization

74 %


33 %

67 %

33 %

67 %

1999 SR as a various



37 %

63 %

29 %

71 %

Table 1: SemARAB vs. Web Search Engines

Moreover, SemARAB differs from general search engines in its ability to determine the type of objects to search for. Table-1 shows the number of relevant results returned by three tools of web searching (Google, Bing and SemARAB) for four test queries. We can note that the number of relevant result in SemARAB is more than Google and Bing. This is because SemARAB is specialized for the searching needs of the ARAB users.


Semantic Web poses promising improvement for search engines as an Internet research tool. However, existing tools and applications are not adequate to address to information search and access needs of the Arabic users. One factor for this is the complicated Arab language and inconsistency in encoding schemes used. Hence, we develop a new architecture for SemARAB including a user-friendly GUI with Arabic translation and semantic ontology for five areas.

Future improvements to the propose system came from the facts that SemARAB system is domain-dependent and the system in its current form is restricted to answering queries pertaining to a particular ontology. In order to operate on a different domain, user should create new ontology or to add more areas. The same should also be applied aside from Arab e-commerce.