Ontology Building Web Information Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Ten years ago, the primary technologies being used to construct large information systems were database systems, information retrieval systems and information filtering systems. Database systems were used to handle large volumes of structured data and to provide guarantees of reliability and consistency despite systems failures and high volumes of update transactions. Information retrieval systems were used to search large databases of text, such as scientific abstracts, legal materials, or newspaper stories. Information filtering or "clipping" services provided periodic updates in the form of text stories, mostly in the business domain, based on user profiles. In the relatively short period since, there have been many developments that have affected how information technology is talked about and used.

The most important of these have been the growth of the Internet and the availability of cheap hardware. The technologies for the large information systems discussed today include the Internet, intranets, extranets, Web search, portals, agents, collaborative filtering, XML and metadata, and data mining using Association Rule to build large item-set as a keywords for web pages.

Data by itself does not carry semantic meaning but needs to be interpreted to convey information. Standard data mining algorithms do not 'understand' the data: data are treated as meaningless numbers or attribute values and statistics are calculated on them to build patterns and models, while the interpretation of the results is left to human experts.

It is well known that the performance of data mining methods can be significantly improved if additional relations among the data objects are taken into account: the knowledge discovery process can significantly benefit from the domain background knowledge, as successfully exploited in relational data mining and Inductive Logic Programming (ILP). Additional means of providing more information to the learner is by providing semantic descriptors to the data. Moreover, as discussed, important additional knowledge to semantic descriptors is also the relations in the underlying ontologies which can be used as an important additional information source for data mining. This chapter focuses the using of eclat algorithm in order to generate association rules.

5.2 SEARCH ENGINES

One of the major tools for information access is the search engine. Most search engines use information retrieval techniques to rank Web pages in presumed order of relevance based on a simple query. Compared to the bibliographic information retrieval systems of the seventies and eighties, the new search engines must deal with information that is much more heterogeneous, "messy", more varied in quality, and vastly more distributed or "linked". Similarly, most Web search engines use a centralized architecture where "Web crawlers" gather Web pages and a single, very large index is created. An approach like this has inherent scalability problems.

There has been a growing awareness that effective information retrieval is a hard problem. Indeed, in a recent Turing Award lecture, it was identified as a software "grand challenge". To address this challenge, researchers in information retrieval and related areas of computer science are proposing new retrieval models and techniques to support distributed architectures, summarization, question answering, cross-lingual retrieval, better interfaces, and multimodal search.

5.3 ECLAT BASED ASSOCIATION RULES

Association rule mining finds interesting association or correlation relationship among a large set of data items with massive amounts of data continuously being collected and stored, many industries are becoming interested in mining association rules from their databases. Let D be a set of n transactions such that D={T1, T2, T3,..,Tn}, where Ti=I and I is a set of items, I = (i1, i2, i3, .. ,im}. A subset of I containing k items is called a k-itemset. Let X and Y be two itemsets such that X  I, Y I, and X Y= . An association rule is an implication denoted by X=>Y where X is called antecedent and Y is called the consequent. Given an itemset X, support s(X) is defined as the fraction of transactions Ti D such that XTi. Consider P(X) the probability of appearance of X in D, and P(Y|X) the conditional probability of appearance of Y given X. P(X) can estimated as P(X)=s(X). The support of a rule X=>Y is defined as s(X=>Y) = s(XUY). An association rule X=>Y has a measure of reliability called the confidence, defined as c(X=>Y) = s(X=>Y)/s(X). Confidence can be used to estimated P(Y|X): P(Y|X) = P(XUY)/P(X) = c(X=>Y).

5.3.1 ECLAT ALGORITHM

A lot of algorithms are emerged for finding interesting relationship to generate association rules. Among that eclat is an efficient algorithm approach is an efficient technique for searching itemsets.

Eclat algorithm is basically a depth-first search algorithm using set intersection. It uses a vertical database layout i.e. instead of explicitly listing all transactions; each item is stored together with its cover also called tidlist and uses the intersection based approach to compute the support of an itemset. In this way, the support of an itemset X can be easily computed by simply intersecting the covers of any two subsets Y, Z ⊆ X, such that Y U Z = X. It states that, when the database is stored in the vertical layout, the support of a set can be counted much easier by simply intersecting the covers of two of its subsets that together give the set itself. The Eclat algorithm is as given below.

Input: D, K, i ⊆ I

Output: F[I](D, K)

1: F[I] :={}

2: for all I Li occurring in D do

3: F[I] := F[I] ∪ {I ∪ {i}}

4: // Create Di

5: Di: = {}

6: for all j Li occurring in D such that j>I do

7: C := cover({i}) O cover({j})

8: if |C| >= K then

9: Di : = Di ∪ {(j, C)}

10: end if

11: end for

12: //Depth-first recursion

13: Compute F[I ∪ {i}]( Di , K)

14: F[I] := F[I] ∪ F[I ∪ {i}]

15: end for

Description

In this algorithm each frequent item is added in the output set. After that, for every such frequent item i, the i-projected database Di is created. This is done by first finding every item j that frequently occurs together with i. The support of this set {i, j} is computed by intersecting the covers of both items. If {i, j} is frequent, then j is inserted into Di together with its cover. The reordering is performed at every recursion step of the algorithm between line 10 and line 11. Then the algorithm is called recursively to find all frequent itemsets in the new database Di. It essentially generates the candidate itemsets using only the join step from Apriori. Again all the items in the database is reordered in ascending order of support to reduce the number of candidate itemsets that is generated, and hence, reduce the number of intersections that need to be computed and the total size of the covers of all generated itemsets. Since the algorithm doesn't fully exploit the monotonicity property, but generates a candidate itemset based on only two of its subsets, the number of candidate itemsets that are generated is much larger as compared to a breadth-first approach such as Apriori. As a comparison, Eclat essentially generates candidate itemsets using only the join step from Apriori, since the itemsets necessary for the prune step are not available.

A technique that is regularly used is to reorder the items in support ascending order to reduce the number of candidate itemsets that is generated. In Eclat, such reordering can be performed at every recursion step in the algorithm. Also that at a certain depth d, the covers of at most all k-itemsets with the same k − 1-prefix are stored in main memory, with k < d. Because of the item reordering, this number is kept small. Table 5.1 represents the sample itemset generation.

Table 5.1: the items abbreviations of database ETDB

Item

Book Title

A

System Programming

B

XML

C

WML

D

Distributed Computing

E

Neural Network

F

Data Mining

Table 5.2 A Transaction Data Base

Transaction TID

Item-(Books)

1

B, C, E

2

B,C, D, E

3

A, B, C, D, E

4

B, C, D

5

A, B, F

6

A, B, C, E

Table 5.3 Large item set with minsup=33%=2

Support

Item set

No.

6=100%

B

1

5=83%

C BC

2

4=67%

E BE CE BCE

3

3=50%

A D AB BD CD BCD

6

2=33%

AC AE DE ABC ABE ACE BDE CDE ABCE BCDE 10

10

5.4. ONTOLOGY AND ECLAT BASED RULE MINING ALGORITHMS

Ontology is used for knowledge sharing and reuse. It improves information organization, management and understanding. Ontology has a significant role in the areas dealing with vast amounts of distributed and heterogeneous computer based information, such as World Wide Web, Intranet information systems, and electronic commerce.

A "conceptualization" is an abstract model of a phenomenon, created by identification of the relevant concepts of the phenomenon. The concepts, the relations between them and the constraints on their use are explicitly defined. "Formal" means that Ontology is machine-readable and excludes the use of natural languages. For example, in medical domains, the concepts are diseases and symptoms, the relations between them a recausal and a constraint is that a disease cannot cause itself. That Ontology is a "shared conceptualization" states that Ontologies aim to represent consensual knowledge intended for the use of a group. Ideally the Ontology captures knowledge independently of its use and in a way that can be shared universally, but practically different tasks and uses call for different representations of the knowledge in ontology. Ontology is sometimes confused with taxonomy, which is a classification of the data in a domain. The difference between them is in two important contexts:

1. Ontology has a richer internal structure as it includes relations and constraints between the concepts.

2. Ontology claims to represent a certain consensus about the knowledge in the domain.

This consensus is among the intended users of the knowledge.

Association rule mining searches for interesting correlations among items in a given data set. It was originally proposed almost a decade ago, in, and has since then attracted enormous attention in both academia and industry.

5.4.1 PROPOSED ONTOLOGY ALGORITHMS

The proposal work was design the Web Information Retrieval System or Ontology Adaptive Web Information Retrieval System. The Websites have been traditionally designed to fit the needs of a generic user but an adaptive Web Information Retrieval System using Association rule to cluster the user keywords and also this algorithm used for classify the Knowledge base content. To build Ontology search Engine needed to design adaptive personalization for user interests of search engine.

USER

KEYWORDS CONCEPT

TAXONOMY/THESAURUS

BASED ON ASSOCIATION RULE

ONTO SERVER

ONTOLOGY LIBRARY

RELEVANT ONTOLOGY

ONTO FINAL RANKING SYSTEM

Figure 5.1: The General Architecture of Onto Server

The proposed system needs the conceptual model to develop the conceptual WIR system based ontology of concept hierarchy for user interests, the general architecture was shown in the Figure 5.1 and the WIR model is shown in Figure 5.2.

USER

SELECT PAGE BASED ON STRUCTURE INPUT KEYWORDS(DATA)

SELECT PAGES KEYWORDS

SAVE ONTOLOGY RETURN RESULT

TO USER

SELECT PAGES ONTO-MINING SERVER

RETURN RESULT SEARCH INTO DB ABOUT

AS TAXONOMY VIEW RELEVANT PAGE

FINAL

ONTO CHOSEN PAGES

MINING

LIBRARY

WEB PAGES

VIEWING (RDF)+META DATA

TAXONOMY VIEWER BY USING

ACCOCIATION RULE

Figure 5.2: The general steps of WIR system

To build this system, the system gathers Information for each user and analyzes it, after this point when the user visits the web site the system analyzes his/her activity by storing tracks by Cookies and produces the user profile. In this level the system need the contextual information for each user this means the proposed needs the three processing:

1. Generation the history of each user to identify the most relevant WebPages for user based on his past histories.

2. Classification of the concepts as the tree of concept and user activity

3. Classification of user profile.

The general algorithm of the design WIR system is explained as follows:

Ontology WIR Algorithm

Begin

Step 1: Call Web information Retrieval Algorithm

Step 2: For each user created a personal tree featuring his/her own concepts and

hierarchy;

Step 3: Simple documents are gathered for each concept in the hierarchy;

Step 4: Using eclat algorithm to classify, that are comparing the user's

simple documents to user interest

Step 5: The system can present the user with Websites organized by using large

itemset as a user's concepts.

Step 6: End

Web Information Retrieval (WIR) Algorithm

Begin

Step 1: Generate user profile where each visitor has:

• Individual file for him/her and give ID for him.

• specify the visitor interest and also the proposed system give for each interest ID this means for each subject has :

1. ID for user documents

2. ID for subject itself

Step 2: Build a Database for each category which mean for each category have:-

• ID for each user has interest(category) in this interest

• ID for each interest (category).

This is produce a document vector (that has personal information and user profile)

Step 3: Generate a super document which means any subject has a many visitor and summation the profiles and put the id for it as a profile ID.

Step 4: Classification process by using Association rule is performed by that compare the results and generate for each subject the top and leafs using Concept hierarchy algorithm.

Step 5: The next time when visitor enter the web site proposed system build: Conceptual Profile (CP) for him based on classification user activity and each documents that it existed in Data Base.

Step 6: The Onto Server takes the CP and user request and make ranking the results for an original rank and then re-rank by combination between original rank and similarity between keywords and CP.

Step 7: Rate the Web Pages and decided which the pages might be of system recommended based user interest

End.

In this algorithm the proposed system first generates identify ID for each user that visits web site and stores his activities in this site, second the proposed system builds Database for concept that user used this database has : ID for each subject and ID for all documents that belong for this concept as thesaurus for Web site. After this point the proposed system makes classification the user concept using algorithm concept hierarchy. When user needs anything from web site the Onto Server takes the request and rates the web pages that related to him interest and also web pages that are recommend from Onto Server using the Response Equation:

Server Response = α * onto Score + (1 - α) * Request Score.........eq.(1)

Where α has a value between 0 and 1. When α has a value of 0, conceptual rank is not given any weight and it is equivalent to pure request based ranking. If α has a value of 1, request based ranking is ignored and pure conceptual rank is considered. Both the conceptual and request based rankings can be blended by varying the values of α.

The proposed system was created using PHP language with Mysql Relational Database and make shortcuts table that when the user search for specific word and the word is not existed in any of the sites stored in the sites table it will go to the shortcuts table that may contain subjects that is related to that word entered by the user and bring the contents of this related subjects that my be interesting for the use as the same as the same subject. The Onto-WIR system needs analysis before the development of search engine showed that it would be of most use on a personal or departmental level, answering the needs of special interest groups or communities.

The prototype for development had very specific characteristics and was much more manageable than if a structure of specialized subject gateways had been used, such as those used by general gateways like Yahoo. The first objective of the design was to have the simplest possible search system, which would reduce the problem of over-complicated interfaces for novice users, and would reply faster with information in greater detail, and be more agile. The following figure 5.3 illustrates the concept of the database architecture.

CLIENT

GUI NAVIGATOR WEB

HTML

HTTP SERVER DB(PHP)

DB MYSQL

DATES

TIPS

USERS

DK

KEYWORDS

PMA_RELATION

Figure 5.3: MySQL Relational Database

The resulting relational scheme consists of six tables: data, types, users, data keyword relations, keywords and pma_relation (Figure 5.3 above). This last relation proves essential when working with this version of MySQL in order to enable the database management system to interpret the relations with many-to many cardinality. The users table remains apart from the scheme and not related to the other tables, as its only task is to centre on administering the various types of users who access the application.

5.5 SUMMARY

The Onto-mining search engine in this research allows the user to perform keyword searches on certain types of "ontology" files, and to visually inspect the files to check their relevance. This research was design the Web Information Retrieval (WIR) System by building the web search engine based on eclat procedure, the user interests this lead to more efficient for web site design and for web site personalization to custom user needs.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.